2014-07-25 17:37:41

by Abhi Das

[permalink] [raw]
Subject: [RFC PATCH 0/2] dirreadahead system call

This system call takes 3 arguments:
fd - file descriptor of the directory being readahead
*offset - offset in dir from which to resume. This is updated
as we move along in the directory
count - The max number of entries to readahead

The syscall is supposed to read upto 'count' entries starting at
'*offset' and cache the inodes corresponding to those entries. It
returns a negative error code or a positive number indicating
the number of inodes it has issued readaheads for. It also
updates the '*offset' value so that repeated calls to dirreadahead
can resume at the right location. Returns 0 when there are no more
entries left.

Abhi Das (2):
fs: Add dirreadahead syscall and VFS hooks
gfs2: GFS2's implementation of the dir_readahead file operation

arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/gfs2/Makefile | 3 +-
fs/gfs2/dir.c | 49 ++++++---
fs/gfs2/dir.h | 15 +++
fs/gfs2/dir_readahead.c | 209 +++++++++++++++++++++++++++++++++++++++
fs/gfs2/file.c | 2 +
fs/gfs2/main.c | 10 +-
fs/gfs2/super.c | 1 +
fs/readdir.c | 49 +++++++++
include/linux/fs.h | 3 +
11 files changed, 328 insertions(+), 15 deletions(-)
create mode 100644 fs/gfs2/dir_readahead.c

--
1.8.1.4


2014-07-25 17:37:43

by Abhi Das

[permalink] [raw]
Subject: [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks

Also adds a void *opaque field to struct dir_context that can be
used by filesystems to temporarily store any context as this
struct gets passed around in the fs.

Signed-off-by: Abhi Das <[email protected]>
---
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
fs/readdir.c | 49 ++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +++
4 files changed, 54 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index d6b8679..3e0ef85 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -360,3 +360,4 @@
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
353 i386 renameat2 sys_renameat2
+354 i386 dirreadahead sys_dirreadahead
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1..2ec0991 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+317 common dirreadahead sys_dirreadahead

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/fs/readdir.c b/fs/readdir.c
index 33fd922..d216db7 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -198,6 +198,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
struct linux_dirent __user * lastdirent;
struct getdents_callback buf = {
.ctx.actor = filldir,
+ .ctx.opaque = NULL,
.count = count,
.current_dir = dirent
};
@@ -278,6 +279,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
struct linux_dirent64 __user * lastdirent;
struct getdents_callback64 buf = {
.ctx.actor = filldir64,
+ .ctx.opaque = NULL,
.count = count,
.current_dir = dirent
};
@@ -304,3 +306,50 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
fdput(f);
return error;
}
+
+SYSCALL_DEFINE3(dirreadahead, unsigned int, fd,
+ loff_t __user *, offset, unsigned int, count)
+{
+ struct fd f;
+ struct inode *inode;
+ int error = -ENOTDIR;
+ loff_t off = 0;
+ struct dir_context ctx = {.actor = NULL, .opaque = NULL};
+
+ if (!count)
+ return -EINVAL;
+
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ inode = f.file->f_path.dentry->d_inode;
+
+ error = -ENOTSUPP;
+ if (!f.file->f_op || !f.file->f_op->dir_readahead)
+ goto out;
+
+ error = security_file_permission(f.file, MAY_READ);
+ if (error)
+ goto out;
+
+ error = -EFAULT;
+ if (__get_user(ctx.pos, offset))
+ goto out;
+
+ error = mutex_lock_killable(&inode->i_mutex);
+ if (error)
+ goto out;
+
+ error = -ENOENT;
+ if (!IS_DEADDIR(inode)) {
+ error = f.file->f_op->dir_readahead(f.file, &ctx, count);
+ if (__put_user(ctx.pos, offset))
+ error = -EFAULT;
+ file_accessed(f.file);
+ }
+ mutex_unlock(&inode->i_mutex);
+out:
+ fdput(f);
+ return error;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338e6f7..fae4a6e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1438,9 +1438,11 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
* to have different dirent layouts depending on the binary type.
*/
typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
+
struct dir_context {
const filldir_t actor;
loff_t pos;
+ void *opaque;
};

struct block_device_operations;
@@ -1463,6 +1465,7 @@ struct file_operations {
ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
int (*iterate) (struct file *, struct dir_context *);
+ int (*dir_readahead) (struct file *, struct dir_context *, unsigned int);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
--
1.8.1.4

2014-07-25 17:41:17

by Abhi Das

[permalink] [raw]
Subject: [RFC PATCH 2/2] gfs2: GFS2's implementation of the dir_readahead file operation

This patch adds support to GFS2 for the dirreadahead system call.

There's a hard limit (128K) on the number of inodes that can be
readahead at one time. There's also a memory cap on the number of
inode numbers we collect. We readahead whatever number of inodes
we have collected until the first of these two caps is hit.

Readahead is done in two stages. In the intial stage, the
directory is read through and all the inode numbers of its entries
are collected. In the second stage, readaheads are performed
asynchronously using workqueues, so the syscall can return to the
user at this stage.
Subsequent getdents calls on the directory and stat calls on the
inodes will have the time-consuming lookups already done for them
and will therefore be quick.

Signed-off-by: Abhi Das <[email protected]>
---
fs/gfs2/Makefile | 3 +-
fs/gfs2/dir.c | 49 +++++++++---
fs/gfs2/dir.h | 15 ++++
fs/gfs2/dir_readahead.c | 209 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/gfs2/file.c | 2 +
fs/gfs2/main.c | 10 ++-
fs/gfs2/super.c | 1 +
7 files changed, 274 insertions(+), 15 deletions(-)
create mode 100644 fs/gfs2/dir_readahead.c

diff --git a/fs/gfs2/Makefile b/fs/gfs2/Makefile
index 8612820..2765c83 100644
--- a/fs/gfs2/Makefile
+++ b/fs/gfs2/Makefile
@@ -4,7 +4,8 @@ gfs2-y := acl.o bmap.o dir.o xattr.o glock.o \
glops.o log.o lops.o main.o meta_io.o \
aops.o dentry.o export.o file.o \
ops_fstype.o inode.o quota.o \
- recovery.o rgrp.o super.o sys.o trans.o util.o
+ recovery.o rgrp.o super.o sys.o trans.o \
+ dir_readahead.o util.o

gfs2-$(CONFIG_GFS2_FS_LOCKING_DLM) += lock_dlm.o

diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index 1a349f9..f068763 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1217,6 +1217,20 @@ static int compare_dents(const void *a, const void *b)
return ret;
}

+static int gfs2_dirent_dot_or_dotdot(const struct gfs2_dirent *dent)
+{
+ const char *name = (char *)(dent + 1);
+
+ if (be16_to_cpu(dent->de_type) == DT_DIR) {
+ if (be16_to_cpu(dent->de_name_len) == 1 && name[0] == '.')
+ return 1;
+ if (be16_to_cpu(dent->de_name_len) == 2 &&
+ strncmp(name, "..", 2) == 0)
+ return 1;
+ }
+ return 0;
+}
+
/**
* do_filldir_main - read out directory entries
* @dip: The GFS2 inode
@@ -1262,8 +1276,12 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
ctx->pos = off;

if (off_next == off) {
- if (*copied && !run)
+ if (*copied && !run) {
+ struct gfs2_dir_ra *d_ra = ctx->opaque;
+ if (d_ra)
+ set_bit(RA_FL_HASHCOLL, &d_ra->flags);
return 1;
+ }
run = 1;
} else
run = 0;
@@ -1273,11 +1291,18 @@ static int do_filldir_main(struct gfs2_inode *dip, struct dir_context *ctx,
ctx->pos = off;
}

- if (!dir_emit(ctx, (const char *)(dent + 1),
- be16_to_cpu(dent->de_name_len),
- be64_to_cpu(dent->de_inum.no_addr),
- be16_to_cpu(dent->de_type)))
- return 1;
+ if (ctx->actor) {
+ if (!dir_emit(ctx, (const char *)(dent + 1),
+ be16_to_cpu(dent->de_name_len),
+ be64_to_cpu(dent->de_inum.no_addr),
+ be16_to_cpu(dent->de_type)))
+ return 1;
+ } else { /* we were called by dir_readahead */
+ if (gfs2_dirent_dot_or_dotdot(dent))
+ continue;
+ if (collect_inode_blocks(ctx, be64_to_cpu(dent->de_inum.no_addr)))
+ return 1;
+ }

*copied = 1;
}
@@ -1311,8 +1336,7 @@ static void gfs2_free_sort_buffer(void *ptr)
}

static int gfs2_dir_read_leaf(struct inode *inode, struct dir_context *ctx,
- int *copied, unsigned *depth,
- u64 leaf_no)
+ int *copied, unsigned *depth, u64 leaf_no)
{
struct gfs2_inode *ip = GFS2_I(inode);
struct gfs2_sbd *sdp = GFS2_SB(inode);
@@ -1399,14 +1423,14 @@ out:
}

/**
- * gfs2_dir_readahead - Issue read-ahead requests for leaf blocks.
+ * gfs2_dir_leaf_ra - Issue read-ahead requests for leaf blocks.
*
* Note: we can't calculate each index like dir_e_read can because we don't
* have the leaf, and therefore we don't have the depth, and therefore we
* don't have the length. So we have to just read enough ahead to make up
* for the loss of information.
*/
-static void gfs2_dir_readahead(struct inode *inode, unsigned hsize, u32 index,
+static void gfs2_dir_leaf_ra(struct inode *inode, unsigned hsize, u32 index,
struct file_ra_state *f_ra)
{
struct gfs2_inode *ip = GFS2_I(inode);
@@ -1474,11 +1498,10 @@ static int dir_e_read(struct inode *inode, struct dir_context *ctx,
if (IS_ERR(lp))
return PTR_ERR(lp);

- gfs2_dir_readahead(inode, hsize, index, f_ra);
+ gfs2_dir_leaf_ra(inode, hsize, index, f_ra);

while (index < hsize) {
- error = gfs2_dir_read_leaf(inode, ctx,
- &copied, &depth,
+ error = gfs2_dir_read_leaf(inode, ctx, &copied, &depth,
be64_to_cpu(lp[index]));
if (error)
break;
diff --git a/fs/gfs2/dir.h b/fs/gfs2/dir.h
index 126c65d..429eea9 100644
--- a/fs/gfs2/dir.h
+++ b/fs/gfs2/dir.h
@@ -25,6 +25,21 @@ struct gfs2_diradd {
struct buffer_head *bh;
};

+extern struct workqueue_struct *gfs2_dir_ra_wq;
+#define RA_MAX_INOS 131072 /*128K */
+#define RA_FL_HASHCOLL 1
+
+struct gfs2_dir_ra {
+ u64 *inos;
+ size_t size;
+ size_t count;
+ unsigned int req;
+ unsigned long flags;
+};
+
+extern int gfs2_dir_readahead(struct file *file, struct dir_context *ctx,
+ unsigned int count);
+extern int collect_inode_blocks(struct dir_context *ctx, u64 offset);
extern struct inode *gfs2_dir_search(struct inode *dir,
const struct qstr *filename,
bool fail_on_exist);
diff --git a/fs/gfs2/dir_readahead.c b/fs/gfs2/dir_readahead.c
new file mode 100644
index 0000000..98888ad
--- /dev/null
+++ b/fs/gfs2/dir_readahead.c
@@ -0,0 +1,209 @@
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+#include <linux/buffer_head.h>
+#include <linux/pagemap.h>
+#include <linux/uio.h>
+#include <linux/blkdev.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/sort.h>
+#include <linux/fs.h>
+#include <linux/gfs2_ondisk.h>
+#include <linux/falloc.h>
+#include <linux/swap.h>
+#include <linux/crc32.h>
+#include <linux/writeback.h>
+#include <asm/uaccess.h>
+#include <linux/dlm.h>
+#include <linux/dlm_plock.h>
+
+#include "gfs2.h"
+#include "incore.h"
+#include "bmap.h"
+#include "dir.h"
+#include "glock.h"
+#include "glops.h"
+#include "inode.h"
+#include "log.h"
+#include "meta_io.h"
+#include "quota.h"
+#include "rgrp.h"
+#include "trans.h"
+#include "util.h"
+
+struct workqueue_struct *gfs2_dir_ra_wq;
+
+static int compare_inos(const void *a, const void *b)
+{
+ u64 ino_a, ino_b;
+
+ ino_a = *(u64 *)a;
+ ino_b = *(u64 *)b;
+
+ if (ino_a > ino_b)
+ return 1;
+ return -1;
+}
+
+static int collect_more(struct gfs2_dir_ra *d_ra)
+{
+ return (d_ra->count < d_ra->req &&
+ (d_ra->count * sizeof(u64)) < d_ra->size);
+}
+
+int collect_inode_blocks(struct dir_context *ctx, u64 ino)
+{
+ struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+ if (!collect_more(d_ra))
+ return 1; /* Collected requested blocks */
+
+ d_ra->inos[d_ra->count++] = ino;
+ return 0;
+}
+
+struct dir_ra_work {
+ struct work_struct work;
+ u64 ino;
+ struct gfs2_sbd *sdp;
+};
+
+static void dir_ra_work_func(struct work_struct *work)
+{
+ struct dir_ra_work *w = container_of(work, struct dir_ra_work, work);
+
+ /* XXX: What to do if sdp disappears by the time we get here? */
+ struct inode *inode = gfs2_lookup_by_inum(w->sdp, w->ino, NULL,
+ GFS2_BLKST_DINODE);
+ if (IS_ERR(inode)) {
+ fs_err(w->sdp, "can't read in inode at addr:%llu: %ld\n",
+ w->ino, PTR_ERR(inode));
+ }
+ gfs2_inode_refresh(GFS2_I(inode));
+ iput(inode);
+ kfree(work);
+}
+
+int gfs2_queue_dir_ra(struct dir_context *ctx, struct gfs2_sbd *sdp)
+{
+ int i;
+ struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+ sort(d_ra->inos, d_ra->count, sizeof(u64), compare_inos, NULL);
+
+ for (i=0; i<d_ra->count; i++) {
+ struct dir_ra_work *w;
+
+ w = kmalloc(sizeof(struct dir_ra_work), GFP_NOFS | __GFP_NOWARN);
+ if (!w)
+ break;
+
+ w->ino = d_ra->inos[i];
+ w->sdp = sdp;
+ INIT_WORK(&w->work, dir_ra_work_func);
+ queue_work(gfs2_dir_ra_wq, &w->work);
+ }
+ if (!i)
+ return -ENOMEM;
+ if (i != d_ra->count)
+ ctx->pos = 0; /* Don't know the resume offset for a short RA */
+ return i;
+}
+
+static inline unsigned int compute_ra_bufsize(unsigned int count)
+{
+ unsigned int size = count * (sizeof(u64));
+
+ if (size > KMALLOC_MAX_SIZE)
+ return KMALLOC_MAX_SIZE;
+ if (size < KMALLOC_MIN_SIZE)
+ return KMALLOC_MIN_SIZE;
+
+ return size;
+}
+
+static int init_ra_context(struct gfs2_inode *ip, struct dir_context *ctx,
+ unsigned int count)
+{
+ unsigned int bufsize;
+ struct gfs2_dir_ra *d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+
+ memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+ count = count > RA_MAX_INOS ? RA_MAX_INOS : count;
+ count = count > ip->i_entries ? ip->i_entries : count;
+
+ bufsize = compute_ra_bufsize(count);
+ d_ra->inos = kmalloc(bufsize, GFP_NOFS | __GFP_NOWARN);
+ if (!d_ra->inos)
+ return -ENOMEM;
+
+ d_ra->size = bufsize;
+ d_ra->req = count;
+
+ return 0;
+}
+
+static void uninit_ra_context(struct dir_context *ctx)
+{
+ struct gfs2_dir_ra *d_ra;
+
+ if (!ctx || !ctx->opaque)
+ return;
+ d_ra = (struct gfs2_dir_ra *) ctx->opaque;
+ if (d_ra->inos)
+ kfree(d_ra->inos);
+ memset(d_ra, 0, sizeof(struct gfs2_dir_ra));
+}
+/**
+ * gfs2_dir_readahead - GFS2's implementation of readdir readahead
+ * @file : The directory to be read from
+ * @ctx : Context contains buffer to collect inode numbers
+ *
+ * Readahead inode disk blocks (and extended attribute blocks if requested)
+ * of every directory entry
+ *
+ * Returns: +ve number: The number of entries for which readahead calls
+ * were issued
+ * -ve values: For error conditions
+ */
+int gfs2_dir_readahead(struct file *file, struct dir_context *ctx, unsigned int count)
+{
+ int error = -EINVAL;
+ struct inode *dir = file->f_mapping->host;
+ struct gfs2_inode *dip = GFS2_I(dir);
+ struct gfs2_holder d_gh;
+ struct gfs2_dir_ra d_ra;
+
+ if (!ctx)
+ goto out;
+
+ ctx->opaque = &d_ra;
+ error = init_ra_context(dip, ctx, count);
+ if (error)
+ goto out;
+
+ gfs2_holder_init(dip->i_gl, LM_ST_SHARED, 0, &d_gh);
+ error = gfs2_glock_nq(&d_gh);
+ if (error) {
+ gfs2_holder_uninit(&d_gh);
+ goto uninit;
+ }
+
+retry:
+ error = gfs2_dir_read(dir, ctx, &file->f_ra);
+ if (test_bit(RA_FL_HASHCOLL, &d_ra.flags)) {
+ clear_bit(RA_FL_HASHCOLL, &d_ra.flags);
+ goto retry;
+ }
+
+ /* Pass the collected inos to the workqueues to be read ahead */
+ if (d_ra.count)
+ error = gfs2_queue_dir_ra(ctx, GFS2_SB(dir));
+
+ gfs2_glock_dq_uninit(&d_gh);
+uninit:
+ uninit_ra_context(ctx);
+out:
+ return error;
+}
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 26b3f95..6135bb9 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -1075,6 +1075,7 @@ const struct file_operations gfs2_file_fops = {

const struct file_operations gfs2_dir_fops = {
.iterate = gfs2_readdir,
+ .dir_readahead = gfs2_dir_readahead,
.unlocked_ioctl = gfs2_ioctl,
.open = gfs2_open,
.release = gfs2_release,
@@ -1105,6 +1106,7 @@ const struct file_operations gfs2_file_fops_nolock = {

const struct file_operations gfs2_dir_fops_nolock = {
.iterate = gfs2_readdir,
+ .dir_readahead = gfs2_dir_readahead,
.unlocked_ioctl = gfs2_ioctl,
.open = gfs2_open,
.release = gfs2_release,
diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index 82b6ac8..71e8ce5 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -161,9 +161,14 @@ static int __init init_gfs2_fs(void)
if (!gfs2_control_wq)
goto fail_recovery;

+ gfs2_dir_ra_wq = alloc_workqueue("gfs2_dir_ra",
+ WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
+ if (!gfs2_dir_ra_wq)
+ goto fail_control;
+
gfs2_page_pool = mempool_create_page_pool(64, 0);
if (!gfs2_page_pool)
- goto fail_control;
+ goto fail_ra;

gfs2_register_debugfs();

@@ -171,6 +176,8 @@ static int __init init_gfs2_fs(void)

return 0;

+fail_ra:
+ destroy_workqueue(gfs2_dir_ra_wq);
fail_control:
destroy_workqueue(gfs2_control_wq);
fail_recovery:
@@ -224,6 +231,7 @@ static void __exit exit_gfs2_fs(void)
unregister_filesystem(&gfs2meta_fs_type);
destroy_workqueue(gfs_recovery_wq);
destroy_workqueue(gfs2_control_wq);
+ destroy_workqueue(gfs2_dir_ra_wq);
list_lru_destroy(&gfs2_qd_lru);

rcu_barrier();
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 669e92e..8654636 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -849,6 +849,7 @@ static int gfs2_make_fs_ro(struct gfs2_sbd *sdp)
kthread_stop(sdp->sd_quotad_process);
kthread_stop(sdp->sd_logd_process);

+ flush_workqueue(gfs2_dir_ra_wq);
flush_workqueue(gfs2_delete_workqueue);
gfs2_quota_sync(sdp->sd_vfs, 0);
gfs2_statfs_sync(sdp->sd_vfs, 0);
--
1.8.1.4

2014-07-26 05:27:30

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

Is there a time when this doesn't get called to prefetch entries in
readdir() order? It isn't clear to me what benefit there is of returning
the entries to userspace instead of just doing the statahead implicitly
in the kernel?

The Lustre client has had what we call "statahead" for a while,
and similar to regular file readahead it detects the sequential access
pattern for readdir() + stat() in readdir() order (taking into account if ".*"
entries are being processed or not) and starts fetching the inode
attributes asynchronously with a worker thread.

This syscall might be more useful if userspace called readdir() to get
the dirents and then passed the kernel the list of inode numbers
to prefetch before starting on the stat() calls. That way, userspace
could generate an arbitrary list of inodes (e.g. names matching a
regexp) and the kernel doesn't need to guess if every inode is needed.

As it stands, this syscall doesn't help in anything other than readdir
order (or of the directory is small enough to be handled in one
syscall), which could be handled by the kernel internally already,
and it may fetch a considerable number of extra inodes from
disk if not every inode needs to be touched.

Cheers, Andreas

> On Jul 25, 2014, at 11:37, Abhi Das <[email protected]> wrote:
>
> This system call takes 3 arguments:
> fd - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
> as we move along in the directory
> count - The max number of entries to readahead
>
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.
>
> Abhi Das (2):
> fs: Add dirreadahead syscall and VFS hooks
> gfs2: GFS2's implementation of the dir_readahead file operation
>
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/gfs2/Makefile | 3 +-
> fs/gfs2/dir.c | 49 ++++++---
> fs/gfs2/dir.h | 15 +++
> fs/gfs2/dir_readahead.c | 209 +++++++++++++++++++++++++++++++++++++++
> fs/gfs2/file.c | 2 +
> fs/gfs2/main.c | 10 +-
> fs/gfs2/super.c | 1 +
> fs/readdir.c | 49 +++++++++
> include/linux/fs.h | 3 +
> 11 files changed, 328 insertions(+), 15 deletions(-)
> create mode 100644 fs/gfs2/dir_readahead.c
>
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-07-28 12:52:17

by Abhi Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call



----- Original Message -----
> From: "Andreas Dilger" <[email protected]>
> To: "Abhi Das" <[email protected]>
> Cc: [email protected], [email protected], [email protected]
> Sent: Saturday, July 26, 2014 12:27:19 AM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
>
> Is there a time when this doesn't get called to prefetch entries in
> readdir() order? It isn't clear to me what benefit there is of returning
> the entries to userspace instead of just doing the statahead implicitly
> in the kernel?
>
> The Lustre client has had what we call "statahead" for a while,
> and similar to regular file readahead it detects the sequential access
> pattern for readdir() + stat() in readdir() order (taking into account if
> ".*"
> entries are being processed or not) and starts fetching the inode
> attributes asynchronously with a worker thread.

Does this heuristic work well in practice? In the use case we were trying to
address, a Samba server is aware beforehand if it is going to stat all the
inodes in a directory.

>
> This syscall might be more useful if userspace called readdir() to get
> the dirents and then passed the kernel the list of inode numbers
> to prefetch before starting on the stat() calls. That way, userspace
> could generate an arbitrary list of inodes (e.g. names matching a
> regexp) and the kernel doesn't need to guess if every inode is needed.

Were you thinking arbitrary inodes across the filesystem or just a subset
from a directory? Arbitrary inodes may potentially throw up locking issues.
But yeah, as Steve mentioned in a previous email, limiting the inodes
readahead in some fashion other than a range in readdir() order is something
that we are thinking of (list of inodes based on regexps, filenames etc). We
just chose to do an offset range of the directory for a quick, early
implementation.

>
> As it stands, this syscall doesn't help in anything other than readdir
> order (or of the directory is small enough to be handled in one
> syscall), which could be handled by the kernel internally already,
> and it may fetch a considerable number of extra inodes from
> disk if not every inode needs to be touched.

The need for this syscall came up from a specific use case - Samba. I'm told
that Windows clients like to stat every file in a directory as soon as it is
read in and this has been a slow operation.

Cheers!
--Abhi

2014-07-28 21:19:41

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

On Jul 28, 2014, at 6:52 AM, Abhijith Das <[email protected]> wrote:
> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <[email protected]> wrote:
>> Is there a time when this doesn't get called to prefetch entries in
>> readdir() order? It isn't clear to me what benefit there is of returning
>> the entries to userspace instead of just doing the statahead implicitly
>> in the kernel?
>>
>> The Lustre client has had what we call "statahead" for a while,
>> and similar to regular file readahead it detects the sequential access
>> pattern for readdir() + stat() in readdir() order (taking into account if
>> ".*"
>> entries are being processed or not) and starts fetching the inode
>> attributes asynchronously with a worker thread.
>
> Does this heuristic work well in practice? In the use case we were trying to
> address, a Samba server is aware beforehand if it is going to stat all the
> inodes in a directory.

Typically this works well for us, because this is done by the Lustre
client, so the statahead is hiding the network latency of the RPCs to
fetch attributes from the server. I imagine the same could be seen with
GFS2. I don't know if this approach would help very much for local
filesystems because the latency is low.

>> This syscall might be more useful if userspace called readdir() to get
>> the dirents and then passed the kernel the list of inode numbers
>> to prefetch before starting on the stat() calls. That way, userspace
>> could generate an arbitrary list of inodes (e.g. names matching a
>> regexp) and the kernel doesn't need to guess if every inode is needed.
>
> Were you thinking arbitrary inodes across the filesystem or just a subset
> from a directory? Arbitrary inodes may potentially throw up locking issues.

I was thinking about inodes returned from readdir(), but the syscall
would be much more useful if it could handle arbitrary inodes. For
example, if directories are small then it may be more efficient to
aggregate inodes from multiple directories for each prefetch syscall.
I can't really think of any locking issues that could exist with
"arbitrary list of inodes" that couldn't be hit by having a directory
with hard links to the same list of inodes, so this is something that
needs to be handled by the filesystem anyway.

Since this would be an advisory syscall (i.e. it doesn't really
return anything and cannot guarantee that all the inodes will be in
memory), then if the filesystem is having trouble prefetching the
inodes (e.g. invalid inode number(s) or lock ordering or contention
issues) it could always bail out and leave it to stat() to actually
fetch the inodes into memory when accessed.

There is no way it would be sane to keep inodes locked in the kernel
after prefetch, in case the "stat" never arrives, so the best it can
do is cache the inodes in memory (on the client for network fs), and
it is possible cache pressure or lock contention drops them from cache.

There are always going to be race conditions even if limited to a
single directory (e.g. another client modifies the inode after calling
dirreadahead(), but before calling stat()) that need to be handled.

I think there are a lot of benefits that could be had by the generic
syscall, possibly similar to what XFS is doing with the "bulkstat"
interfaces that Dave always mentions. This would be much more so for
cases were you don't want to stat all of the entries in a directory.

> But yeah, as Steve mentioned in a previous email, limiting the inodes
> readahead in some fashion other than a range in readdir() order is
> something that we are thinking of (list of inodes based on regexps,
> filenames etc). We just chose to do an offset range of the directory
> for a quick, early implementation.



>> As it stands, this syscall doesn't help in anything other than readdir
>> order (or of the directory is small enough to be handled in one
>> syscall), which could be handled by the kernel internally already,
>> and it may fetch a considerable number of extra inodes from
>> disk if not every inode needs to be touched.
>
> The need for this syscall came up from a specific use case - Samba.
> I'm told that Windows clients like to stat every file in a directory
> as soon as it is read in and this has been a slow operation.

Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <[email protected]> wrote:
> This system call takes 3 arguments:
> fd - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
> as we move along in the directory
> count - The max number of entries to readahead
>
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.

Hello Abhi,

As per Documentation/SubmitChecklist, please CC linux-api on patches
that change the kerne-user-space API/ABI. (See
https://www.kernel.org/doc/man-pages/linux-api-ml.html for more
details.)

Cheers,

Michael


> Abhi Das (2):
> fs: Add dirreadahead syscall and VFS hooks
> gfs2: GFS2's implementation of the dir_readahead file operation
>
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/gfs2/Makefile | 3 +-
> fs/gfs2/dir.c | 49 ++++++---
> fs/gfs2/dir.h | 15 +++
> fs/gfs2/dir_readahead.c | 209 +++++++++++++++++++++++++++++++++++++++
> fs/gfs2/file.c | 2 +
> fs/gfs2/main.c | 10 +-
> fs/gfs2/super.c | 1 +
> fs/readdir.c | 49 +++++++++
> include/linux/fs.h | 3 +
> 11 files changed, 328 insertions(+), 15 deletions(-)
> create mode 100644 fs/gfs2/dir_readahead.c
>
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

Subject: Re: [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks

[CC+=linux-api]

On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <[email protected]> wrote:
> Also adds a void *opaque field to struct dir_context that can be
> used by filesystems to temporarily store any context as this
> struct gets passed around in the fs.
>
> Signed-off-by: Abhi Das <[email protected]>
> ---
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> fs/readdir.c | 49 ++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 3 +++
> 4 files changed, 54 insertions(+)
>
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..3e0ef85 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
> 351 i386 sched_setattr sys_sched_setattr
> 352 i386 sched_getattr sys_sched_getattr
> 353 i386 renameat2 sys_renameat2
> +354 i386 dirreadahead sys_dirreadahead
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..2ec0991 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
> 314 common sched_setattr sys_sched_setattr
> 315 common sched_getattr sys_sched_getattr
> 316 common renameat2 sys_renameat2
> +317 common dirreadahead sys_dirreadahead
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/fs/readdir.c b/fs/readdir.c
> index 33fd922..d216db7 100644
> --- a/fs/readdir.c
> +++ b/fs/readdir.c
> @@ -198,6 +198,7 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
> struct linux_dirent __user * lastdirent;
> struct getdents_callback buf = {
> .ctx.actor = filldir,
> + .ctx.opaque = NULL,
> .count = count,
> .current_dir = dirent
> };
> @@ -278,6 +279,7 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> struct linux_dirent64 __user * lastdirent;
> struct getdents_callback64 buf = {
> .ctx.actor = filldir64,
> + .ctx.opaque = NULL,
> .count = count,
> .current_dir = dirent
> };
> @@ -304,3 +306,50 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
> fdput(f);
> return error;
> }
> +
> +SYSCALL_DEFINE3(dirreadahead, unsigned int, fd,
> + loff_t __user *, offset, unsigned int, count)
> +{
> + struct fd f;
> + struct inode *inode;
> + int error = -ENOTDIR;
> + loff_t off = 0;
> + struct dir_context ctx = {.actor = NULL, .opaque = NULL};
> +
> + if (!count)
> + return -EINVAL;
> +
> + f = fdget(fd);
> + if (!f.file)
> + return -EBADF;
> +
> + inode = f.file->f_path.dentry->d_inode;
> +
> + error = -ENOTSUPP;
> + if (!f.file->f_op || !f.file->f_op->dir_readahead)
> + goto out;
> +
> + error = security_file_permission(f.file, MAY_READ);
> + if (error)
> + goto out;
> +
> + error = -EFAULT;
> + if (__get_user(ctx.pos, offset))
> + goto out;
> +
> + error = mutex_lock_killable(&inode->i_mutex);
> + if (error)
> + goto out;
> +
> + error = -ENOENT;
> + if (!IS_DEADDIR(inode)) {
> + error = f.file->f_op->dir_readahead(f.file, &ctx, count);
> + if (__put_user(ctx.pos, offset))
> + error = -EFAULT;
> + file_accessed(f.file);
> + }
> + mutex_unlock(&inode->i_mutex);
> +out:
> + fdput(f);
> + return error;
> +}
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 338e6f7..fae4a6e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1438,9 +1438,11 @@ int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags);
> * to have different dirent layouts depending on the binary type.
> */
> typedef int (*filldir_t)(void *, const char *, int, loff_t, u64, unsigned);
> +
> struct dir_context {
> const filldir_t actor;
> loff_t pos;
> + void *opaque;
> };
>
> struct block_device_operations;
> @@ -1463,6 +1465,7 @@ struct file_operations {
> ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
> ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
> int (*iterate) (struct file *, struct dir_context *);
> + int (*dir_readahead) (struct file *, struct dir_context *, unsigned int);
> unsigned int (*poll) (struct file *, struct poll_table_struct *);
> long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
> long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Michael Kerrisk Linux man-pages maintainer;
http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface", http://blog.man7.org/

2014-07-29 09:36:40

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [Cluster-devel] [RFC PATCH 0/2] dirreadahead system call

Hi,

On 28/07/14 22:19, Andreas Dilger wrote:
> On Jul 28, 2014, at 6:52 AM, Abhijith Das <[email protected]> wrote:
>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <[email protected]> wrote:
>>> Is there a time when this doesn't get called to prefetch entries in
>>> readdir() order? It isn't clear to me what benefit there is of returning
>>> the entries to userspace instead of just doing the statahead implicitly
>>> in the kernel?
>>>
>>> The Lustre client has had what we call "statahead" for a while,
>>> and similar to regular file readahead it detects the sequential access
>>> pattern for readdir() + stat() in readdir() order (taking into account if
>>> ".*"
>>> entries are being processed or not) and starts fetching the inode
>>> attributes asynchronously with a worker thread.
>> Does this heuristic work well in practice? In the use case we were trying to
>> address, a Samba server is aware beforehand if it is going to stat all the
>> inodes in a directory.
> Typically this works well for us, because this is done by the Lustre
> client, so the statahead is hiding the network latency of the RPCs to
> fetch attributes from the server. I imagine the same could be seen with
> GFS2. I don't know if this approach would help very much for local
> filesystems because the latency is low.
Well GFS2 is more or less a local fs in the case of no contention. There
are issues that arise in the contention case, which is why it is very
important to ensure that there is as close to 100% chance of using the
inodes which are pre-fetched as possible. Thats why we've not yet tried
out algorithms that try to guess when the inodes will be required, since
one wrong guess can be quite costly in terms of lock bouncing between
nodes. However it does seem that the suggestions made in this thread
about using ->lookup in the directory as one way to gather information
may be a good way to go, since it will only be called for uncached
inodes and in addition it would be equally useful to any possible
operation which may follow the lookup.

>>> This syscall might be more useful if userspace called readdir() to get
>>> the dirents and then passed the kernel the list of inode numbers
>>> to prefetch before starting on the stat() calls. That way, userspace
>>> could generate an arbitrary list of inodes (e.g. names matching a
>>> regexp) and the kernel doesn't need to guess if every inode is needed.
>> Were you thinking arbitrary inodes across the filesystem or just a subset
>> from a directory? Arbitrary inodes may potentially throw up locking issues.
> I was thinking about inodes returned from readdir(), but the syscall
> would be much more useful if it could handle arbitrary inodes. For
> example, if directories are small then it may be more efficient to
> aggregate inodes from multiple directories for each prefetch syscall.
> I can't really think of any locking issues that could exist with
> "arbitrary list of inodes" that couldn't be hit by having a directory
> with hard links to the same list of inodes, so this is something that
> needs to be handled by the filesystem anyway.
>
> Since this would be an advisory syscall (i.e. it doesn't really
> return anything and cannot guarantee that all the inodes will be in
> memory), then if the filesystem is having trouble prefetching the
> inodes (e.g. invalid inode number(s) or lock ordering or contention
> issues) it could always bail out and leave it to stat() to actually
> fetch the inodes into memory when accessed.
>
> There is no way it would be sane to keep inodes locked in the kernel
> after prefetch, in case the "stat" never arrives, so the best it can
> do is cache the inodes in memory (on the client for network fs), and
> it is possible cache pressure or lock contention drops them from cache.
>
> There are always going to be race conditions even if limited to a
> single directory (e.g. another client modifies the inode after calling
> dirreadahead(), but before calling stat()) that need to be handled.
>
> I think there are a lot of benefits that could be had by the generic
> syscall, possibly similar to what XFS is doing with the "bulkstat"
> interfaces that Dave always mentions. This would be much more so for
> cases were you don't want to stat all of the entries in a directory.
>
Yes, it probably would be possible to do a kind of bulk lookup like
that, although it would also be more complicated locking-wise if we
allowed arbitrary inode numbers, rather than doing it on a per directory
basis. I can see arguments for both approaches though,

Steve.

2014-07-31 03:18:17

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das <[email protected]> wrote:

> This system call takes 3 arguments:
> fd - file descriptor of the directory being readahead
> *offset - offset in dir from which to resume. This is updated
> as we move along in the directory
> count - The max number of entries to readahead
>
> The syscall is supposed to read upto 'count' entries starting at
> '*offset' and cache the inodes corresponding to those entries. It
> returns a negative error code or a positive number indicating
> the number of inodes it has issued readaheads for. It also
> updates the '*offset' value so that repeated calls to dirreadahead
> can resume at the right location. Returns 0 when there are no more
> entries left.

Hi Abhi,

I like the idea of enhanced read-ahead on a directory.
It isn't clear to me why you have included these particular fields in the
interface though.

- why have an 'offset'? Why not just use the current offset of the
directory 'fd'?
- Why have a count? How would a program choose what count to give?

Maybe you imagine using 'getdents' first to get a list of names, then
selectively calling 'dirreadahead' on the offsets of the names you are
interested it? That would be racy as names can be added and removed which
might change offsets. So maybe you have another reason?

I would like to suggest an alternate interface (I love playing the API
game....).

1/ Add a flag to 'fstatat' AT_EXPECT_MORE.
If the pathname does not contain a '/', then the 'dirfd' is marked
to indicate that stat information for all names returned by getdents will
be wanted. The filesystem can choose to optimise that however it sees
fit.

2/ Add a flag to 'fstatat' AT_NONBLOCK.
This tells the filesystem that you want this information, so if it can
return it immediately it should, and if not it should start pulling it
into cache. Possibly this should be two flags: AT_NONBLOCK just avoids
any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.

Then an "ls -l" could use AT_EXPECT_MORE and then just stat each name.
An "ls -l *.c", might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
against all names, then try again with all the names that returned
EWOULDBLOCK the first time.


I would really like to see the 'xstat' syscall too, but there is no point
having both "xstat" and "fxstat". Follow the model of "fstatat" and provide
just "fxstatat" which can do both.
With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
would be wanted so it can fetch only that which is desired.

I'm not very keen on the xgetdents idea of including name information and
stat information into the one syscall - I would prefer getdents and xstat be
kept separate. Of course if a genuine performance cost of the separate can
be demonstrated, I could well change my mind.

It does, however, have the advantage that the kernel doesn't need to worry
about how long read-ahead data needs to be kept, and the application doesn't
need to worry about how soon to retry an fstatat which failed with
EWOULDBLOCK.

Thanks for raising this issue again. I hope it gets fixed one day...

NeilBrown


Attachments:
signature.asc (828.00 B)

2014-07-31 03:32:04

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] fs: Add dirreadahead syscall and VFS hooks

On Tue, Jul 29, 2014 at 10:21:50AM +0200, Michael Kerrisk wrote:
> [CC+=linux-api]
>
> On Fri, Jul 25, 2014 at 7:37 PM, Abhi Das <[email protected]> wrote:
> > Also adds a void *opaque field to struct dir_context that can be
> > used by filesystems to temporarily store any context as this
> > struct gets passed around in the fs.

So the prototype is:

int dir_readahead(int fd, off64_t offset, unsigned int count);

Why do we need a new syscall for this?

$ man 2 readahead
....
ssize_t readahead(int fd, off64_t offset, size_t count);
....
EINVAL fd does not refer to a file type to which readahead() can be applied.


Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-07-31 04:49:15

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> On Jul 28, 2014, at 6:52 AM, Abhijith Das <[email protected]> wrote:
> > OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <[email protected]> wrote:
> >> Is there a time when this doesn't get called to prefetch entries in
> >> readdir() order? It isn't clear to me what benefit there is of returning
> >> the entries to userspace instead of just doing the statahead implicitly
> >> in the kernel?
> >>
> >> The Lustre client has had what we call "statahead" for a while,
> >> and similar to regular file readahead it detects the sequential access
> >> pattern for readdir() + stat() in readdir() order (taking into account if
> >> ".*"
> >> entries are being processed or not) and starts fetching the inode
> >> attributes asynchronously with a worker thread.
> >
> > Does this heuristic work well in practice? In the use case we were trying to
> > address, a Samba server is aware beforehand if it is going to stat all the
> > inodes in a directory.
>
> Typically this works well for us, because this is done by the Lustre
> client, so the statahead is hiding the network latency of the RPCs to
> fetch attributes from the server. I imagine the same could be seen with
> GFS2. I don't know if this approach would help very much for local
> filesystems because the latency is low.
>
> >> This syscall might be more useful if userspace called readdir() to get
> >> the dirents and then passed the kernel the list of inode numbers
> >> to prefetch before starting on the stat() calls. That way, userspace
> >> could generate an arbitrary list of inodes (e.g. names matching a
> >> regexp) and the kernel doesn't need to guess if every inode is needed.
> >
> > Were you thinking arbitrary inodes across the filesystem or just a subset
> > from a directory? Arbitrary inodes may potentially throw up locking issues.
>
> I was thinking about inodes returned from readdir(), but the syscall
> would be much more useful if it could handle arbitrary inodes.

I'mnot sure we can do that. The only way to safely identify a
specific inode in the filesystem from userspace is via a filehandle.
Plain inode numbers are susceptible to TOCTOU race conditions that
the kernel cannot resolve. Also, lookup by inode number bypasses
directory access permissions, so is not something we would expose
to arbitrary unprivileged users.

> There are always going to be race conditions even if limited to a
> single directory (e.g. another client modifies the inode after calling
> dirreadahead(), but before calling stat()) that need to be handled.

Unlink/realloc to a different directory with different access
permissions is the big issue.

> I think there are a lot of benefits that could be had by the generic
> syscall, possibly similar to what XFS is doing with the "bulkstat"
> interfaces that Dave always mentions. This would be much more so for
> cases were you don't want to stat all of the entries in a directory.

Bulkstat is not really suited to this - it gets it's speed
specifically by avoiding directory traversal to find inodes. Hence
it is a root-only operation because it bypasses directory based
access restrictions and hence is really only useful for
full-filesystem traversal operations like backups, defragmentation,
etc.

Bulkstat output also contains enough information to construct a
valid file handle in userspace and so access to inodes found via
bulkstat can be gained via the XFS open-by-handle interfaces. Again,
this bypasses permissions checking and hence is a root-only
operation. It does, however, avoid TOCTOU races because the open-by-handle
will fail if the inode is unlinked and reallocated between the
bulkstat call and the open-by-handle as the generation number in the
handle will no longer match that of the inode.

Cheers,

Dave.



--
Dave Chinner
[email protected]

2014-07-31 11:19:51

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

On Jul 31, 2014, at 6:49, Dave Chinner <[email protected]> wrote:
>
>> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
>>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <[email protected]> wrote:
>>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <[email protected]> wrote:
>>>> Is there a time when this doesn't get called to prefetch entries in
>>>> readdir() order? It isn't clear to me what benefit there is of returning
>>>> the entries to userspace instead of just doing the statahead implicitly
>>>> in the kernel?
>>>>
>>>> The Lustre client has had what we call "statahead" for a while,
>>>> and similar to regular file readahead it detects the sequential access
>>>> pattern for readdir() + stat() in readdir() order (taking into account if
>>>> ".*"
>>>> entries are being processed or not) and starts fetching the inode
>>>> attributes asynchronously with a worker thread.
>>>
>>> Does this heuristic work well in practice? In the use case we were trying to
>>> address, a Samba server is aware beforehand if it is going to stat all the
>>> inodes in a directory.
>>
>> Typically this works well for us, because this is done by the Lustre
>> client, so the statahead is hiding the network latency of the RPCs to
>> fetch attributes from the server. I imagine the same could be seen with
>> GFS2. I don't know if this approach would help very much for local
>> filesystems because the latency is low.
>>
>>>> This syscall might be more useful if userspace called readdir() to get
>>>> the dirents and then passed the kernel the list of inode numbers
>>>> to prefetch before starting on the stat() calls. That way, userspace
>>>> could generate an arbitrary list of inodes (e.g. names matching a
>>>> regexp) and the kernel doesn't need to guess if every inode is needed.
>>>
>>> Were you thinking arbitrary inodes across the filesystem or just a subset
>>> from a directory? Arbitrary inodes may potentially throw up locking issues.
>>
>> I was thinking about inodes returned from readdir(), but the syscall
>> would be much more useful if it could handle arbitrary inodes.
>
> I'm not sure we can do that. The only way to safely identify a
> specific inode in the filesystem from userspace is via a filehandle.
> Plain inode numbers are susceptible to TOCTOU race conditions that
> the kernel cannot resolve. Also, lookup by inode number bypasses
> directory access permissions, so is not something we would expose
> to arbitrary unprivileged users.

None of these issues are relevant in the API that I'm thinking about.
The syscall just passes the list of inode numbers to be prefetched
into kernel memory, and then stat() is used to actually get the data into
userspace (or whatever other operation is to be done on them),
so there is no danger if the wrong inode is prefetched. If the inode
number is bad the filesystem can just ignore it.

>> There are always going to be race conditions even if limited to a
>> single directory (e.g. another client modifies the inode after calling
>> dirreadahead(), but before calling stat()) that need to be handled.
>
> Unlink/realloc to a different directory with different access
> permissions is the big issue.
>
>> I think there are a lot of benefits that could be had by the generic
>> syscall, possibly similar to what XFS is doing with the "bulkstat"
>> interfaces that Dave always mentions. This would be much more so for
>> cases were you don't want to stat all of the entries in a directory.
>
> Bulkstat is not really suited to this - it gets it's speed
> specifically by avoiding directory traversal to find inodes. Hence
> it is a root-only operation because it bypasses directory based
> access restrictions and hence is really only useful for
> full-filesystem traversal operations like backups, defragmentation,
> etc.

Maybe it was a mistake to mention bulkstat in this discussion. The
stat() information would not actually be returned to memory in my
proposal, just the inodes would be loaded into memory in bulk so
that the later stat() (or unlink() or whatever) can go quickly.

Cheers, Andreas

> Bulkstat output also contains enough information to construct a
> valid file handle in userspace and so access to inodes found via
> bulkstat can be gained via the XFS open-by-handle interfaces. Again,
> this bypasses permissions checking and hence is a root-only
> operation. It does, however, avoid TOCTOU races because the open-by-handle
> will fail if the inode is unlinked and reallocated between the
> bulkstat call and the open-by-handle as the generation number in the
> handle will no longer match that of the inode.
>
> Cheers,
>
> Dave.
>
>
>
> --
> Dave Chinner
> [email protected]

2014-07-31 23:53:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> On Jul 31, 2014, at 6:49, Dave Chinner <[email protected]> wrote:
> >
> >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <[email protected]> wrote:
> >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <[email protected]> wrote:
> >>>> Is there a time when this doesn't get called to prefetch entries in
> >>>> readdir() order? It isn't clear to me what benefit there is of returning
> >>>> the entries to userspace instead of just doing the statahead implicitly
> >>>> in the kernel?
> >>>>
> >>>> The Lustre client has had what we call "statahead" for a while,
> >>>> and similar to regular file readahead it detects the sequential access
> >>>> pattern for readdir() + stat() in readdir() order (taking into account if
> >>>> ".*"
> >>>> entries are being processed or not) and starts fetching the inode
> >>>> attributes asynchronously with a worker thread.
> >>>
> >>> Does this heuristic work well in practice? In the use case we were trying to
> >>> address, a Samba server is aware beforehand if it is going to stat all the
> >>> inodes in a directory.
> >>
> >> Typically this works well for us, because this is done by the Lustre
> >> client, so the statahead is hiding the network latency of the RPCs to
> >> fetch attributes from the server. I imagine the same could be seen with
> >> GFS2. I don't know if this approach would help very much for local
> >> filesystems because the latency is low.
> >>
> >>>> This syscall might be more useful if userspace called readdir() to get
> >>>> the dirents and then passed the kernel the list of inode numbers
> >>>> to prefetch before starting on the stat() calls. That way, userspace
> >>>> could generate an arbitrary list of inodes (e.g. names matching a
> >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> >>>
> >>> Were you thinking arbitrary inodes across the filesystem or just a subset
> >>> from a directory? Arbitrary inodes may potentially throw up locking issues.
> >>
> >> I was thinking about inodes returned from readdir(), but the syscall
> >> would be much more useful if it could handle arbitrary inodes.
> >
> > I'm not sure we can do that. The only way to safely identify a
> > specific inode in the filesystem from userspace is via a filehandle.
> > Plain inode numbers are susceptible to TOCTOU race conditions that
> > the kernel cannot resolve. Also, lookup by inode number bypasses
> > directory access permissions, so is not something we would expose
> > to arbitrary unprivileged users.
>
> None of these issues are relevant in the API that I'm thinking about.
> The syscall just passes the list of inode numbers to be prefetched
> into kernel memory, and then stat() is used to actually get the data into
> userspace (or whatever other operation is to be done on them),
> so there is no danger if the wrong inode is prefetched. If the inode
> number is bad the filesystem can just ignore it.

Which means the filesystem has to treat the inode number as
potentially hostile. i.e. it can not be trusted to be correct and so
must take slow paths to validate the inode numbers. This adds
*significant* overhead to the readahead path for some filesystems:
readahead is only a win if it is low cost.

For example, on XFS every untrusted inode number lookup requires an
inode btree lookup to validate the inode is actually valid on disk
and that is it allocated and has references. That lookup serialises
against inode allocation/freeing as well as other lookups. In
comparison, when using a trusted inode number from a directory
lookup within the kernel, we only need to do a couple of shift and
mask operations to convert it to a disk address and we are good to
go.

i.e. the difference is at least 5 orders of magnitude higher CPU usage
for an "inode number readahead" syscall versus a "directory
readahead" syscall, it has significant serialisation issues and it
can stall other modification/lookups going on at the same time.
That's *horrible behaviour* for a speculative readahead operation,
but because the inodenumbers are untrusted, we can't avoid it.

So, again, it's way more overhead than userspace just calling
stat() asycnhronously on many files at once as readdir/gentdents
returns dirents from the kernel to speed up cache population.

That's my main issue with this patchset - it's implementing
something in kernelspace that can *easily* be done generically in
userspace without introducing all sorts of nasty corner cases that
we have to handle in the kernel. We only add functionality to the kernel if there's a
compelling reason to do it in kernelspace, and right now I just
don't see any numbers that justify adding readdir+stat() readahead
or inode number based cache population in kernelspace.

Before we add *any* syscall for directory readahead, we need
comparison numbers against doing the dumb multithreaded
userspace readahead of stat() calls. If userspace can do this as
fast as the kernel can....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-08-01 02:11:16

by Abhi Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call



----- Original Message -----
> From: "Dave Chinner" <[email protected]>
> To: "Andreas Dilger" <[email protected]>
> Cc: "Abhijith Das" <[email protected]>, "LKML" <[email protected]>, "linux-fsdevel"
> <[email protected]>, [email protected]
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
>
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <[email protected]> wrote:
> > >
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <[email protected]> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <[email protected]> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order? It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>>
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>>
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >>
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server. I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >>
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>>
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >>
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > >
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> >
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched. If the inode
> > number is bad the filesystem can just ignore it.
>
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
>
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
>
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
>
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
>
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
>
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
>

This was next on my to-do list to see how an all-userspace solution compares
with doing this in the kernel. I'll post some results as soon as I can find
some time to play with this stuff again.

Cheers!
--Abhi

2014-08-01 02:21:18

by Abhi Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call



----- Original Message -----
> From: "NeilBrown" <[email protected]>
> To: "Abhi Das" <[email protected]>
> Cc: [email protected], [email protected], [email protected]
> Sent: Wednesday, July 30, 2014 10:18:05 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
>
> On Fri, 25 Jul 2014 12:37:29 -0500 Abhi Das <[email protected]> wrote:
>
> > This system call takes 3 arguments:
> > fd - file descriptor of the directory being readahead
> > *offset - offset in dir from which to resume. This is updated
> > as we move along in the directory
> > count - The max number of entries to readahead
> >
> > The syscall is supposed to read upto 'count' entries starting at
> > '*offset' and cache the inodes corresponding to those entries. It
> > returns a negative error code or a positive number indicating
> > the number of inodes it has issued readaheads for. It also
> > updates the '*offset' value so that repeated calls to dirreadahead
> > can resume at the right location. Returns 0 when there are no more
> > entries left.
>
> Hi Abhi,
>
> I like the idea of enhanced read-ahead on a directory.
> It isn't clear to me why you have included these particular fields in the
> interface though.
>
> - why have an 'offset'? Why not just use the current offset of the
> directory 'fd'?

The idea was that we didn't want a syscall like readahead mucking with the
file pointer as the same fd might be used to do getdents().

> - Why have a count? How would a program choose what count to give?

If a program knows that it's only going to work on a subset of files at a time,
it can use the count value to only readahead a small number of inodes at once
instead of reading ahead the entire directory.

That said, this interface is not set in stone and we are exploring ways to inform
the kernel of the inodes we are interested in reading ahead.

>
> Maybe you imagine using 'getdents' first to get a list of names, then
> selectively calling 'dirreadahead' on the offsets of the names you are
> interested it? That would be racy as names can be added and removed which
> might change offsets. So maybe you have another reason?
>
> I would like to suggest an alternate interface (I love playing the API
> game....).
>
> 1/ Add a flag to 'fstatat' AT_EXPECT_MORE.
> If the pathname does not contain a '/', then the 'dirfd' is marked
> to indicate that stat information for all names returned by getdents will
> be wanted. The filesystem can choose to optimise that however it sees
> fit.
>
> 2/ Add a flag to 'fstatat' AT_NONBLOCK.
> This tells the filesystem that you want this information, so if it can
> return it immediately it should, and if not it should start pulling it
> into cache. Possibly this should be two flags: AT_NONBLOCK just avoids
> any IO, and AT_ASYNC instigates IO even if NONBLOCK is set.
>
> Then an "ls -l" could use AT_EXPECT_MORE and then just stat each name.
> An "ls -l *.c", might avoid AT_EXPECT_MORE, but would use AT_NONBLOCK
> against all names, then try again with all the names that returned
> EWOULDBLOCK the first time.
>
>
> I would really like to see the 'xstat' syscall too, but there is no point
> having both "xstat" and "fxstat". Follow the model of "fstatat" and provide
> just "fxstatat" which can do both.
> With fxstatat, AT_EXPECT_MORE would tell the dirfd exactly which attributes
> would be wanted so it can fetch only that which is desired.
>
> I'm not very keen on the xgetdents idea of including name information and
> stat information into the one syscall - I would prefer getdents and xstat be
> kept separate. Of course if a genuine performance cost of the separate can
> be demonstrated, I could well change my mind.
>
> It does, however, have the advantage that the kernel doesn't need to worry
> about how long read-ahead data needs to be kept, and the application doesn't
> need to worry about how soon to retry an fstatat which failed with
> EWOULDBLOCK.
>
> Thanks for raising this issue again. I hope it gets fixed one day...
>
> NeilBrown
>

2014-10-21 05:22:37

by Abhi Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call



----- Original Message -----
> From: "Dave Chinner" <[email protected]>
> To: "Andreas Dilger" <[email protected]>
> Cc: "Abhijith Das" <[email protected]>, "LKML" <[email protected]>, "linux-fsdevel"
> <[email protected]>, [email protected]
> Sent: Thursday, July 31, 2014 6:53:06 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
>
> On Thu, Jul 31, 2014 at 01:19:45PM +0200, Andreas Dilger wrote:
> > On Jul 31, 2014, at 6:49, Dave Chinner <[email protected]> wrote:
> > >
> > >> On Mon, Jul 28, 2014 at 03:19:31PM -0600, Andreas Dilger wrote:
> > >>> On Jul 28, 2014, at 6:52 AM, Abhijith Das <[email protected]> wrote:
> > >>> OnJuly 26, 2014 12:27:19 AM "Andreas Dilger" <[email protected]> wrote:
> > >>>> Is there a time when this doesn't get called to prefetch entries in
> > >>>> readdir() order? It isn't clear to me what benefit there is of
> > >>>> returning
> > >>>> the entries to userspace instead of just doing the statahead
> > >>>> implicitly
> > >>>> in the kernel?
> > >>>>
> > >>>> The Lustre client has had what we call "statahead" for a while,
> > >>>> and similar to regular file readahead it detects the sequential access
> > >>>> pattern for readdir() + stat() in readdir() order (taking into account
> > >>>> if
> > >>>> ".*"
> > >>>> entries are being processed or not) and starts fetching the inode
> > >>>> attributes asynchronously with a worker thread.
> > >>>
> > >>> Does this heuristic work well in practice? In the use case we were
> > >>> trying to
> > >>> address, a Samba server is aware beforehand if it is going to stat all
> > >>> the
> > >>> inodes in a directory.
> > >>
> > >> Typically this works well for us, because this is done by the Lustre
> > >> client, so the statahead is hiding the network latency of the RPCs to
> > >> fetch attributes from the server. I imagine the same could be seen with
> > >> GFS2. I don't know if this approach would help very much for local
> > >> filesystems because the latency is low.
> > >>
> > >>>> This syscall might be more useful if userspace called readdir() to get
> > >>>> the dirents and then passed the kernel the list of inode numbers
> > >>>> to prefetch before starting on the stat() calls. That way, userspace
> > >>>> could generate an arbitrary list of inodes (e.g. names matching a
> > >>>> regexp) and the kernel doesn't need to guess if every inode is needed.
> > >>>
> > >>> Were you thinking arbitrary inodes across the filesystem or just a
> > >>> subset
> > >>> from a directory? Arbitrary inodes may potentially throw up locking
> > >>> issues.
> > >>
> > >> I was thinking about inodes returned from readdir(), but the syscall
> > >> would be much more useful if it could handle arbitrary inodes.
> > >
> > > I'm not sure we can do that. The only way to safely identify a
> > > specific inode in the filesystem from userspace is via a filehandle.
> > > Plain inode numbers are susceptible to TOCTOU race conditions that
> > > the kernel cannot resolve. Also, lookup by inode number bypasses
> > > directory access permissions, so is not something we would expose
> > > to arbitrary unprivileged users.
> >
> > None of these issues are relevant in the API that I'm thinking about.
> > The syscall just passes the list of inode numbers to be prefetched
> > into kernel memory, and then stat() is used to actually get the data into
> > userspace (or whatever other operation is to be done on them),
> > so there is no danger if the wrong inode is prefetched. If the inode
> > number is bad the filesystem can just ignore it.
>
> Which means the filesystem has to treat the inode number as
> potentially hostile. i.e. it can not be trusted to be correct and so
> must take slow paths to validate the inode numbers. This adds
> *significant* overhead to the readahead path for some filesystems:
> readahead is only a win if it is low cost.
>
> For example, on XFS every untrusted inode number lookup requires an
> inode btree lookup to validate the inode is actually valid on disk
> and that is it allocated and has references. That lookup serialises
> against inode allocation/freeing as well as other lookups. In
> comparison, when using a trusted inode number from a directory
> lookup within the kernel, we only need to do a couple of shift and
> mask operations to convert it to a disk address and we are good to
> go.
>
> i.e. the difference is at least 5 orders of magnitude higher CPU usage
> for an "inode number readahead" syscall versus a "directory
> readahead" syscall, it has significant serialisation issues and it
> can stall other modification/lookups going on at the same time.
> That's *horrible behaviour* for a speculative readahead operation,
> but because the inodenumbers are untrusted, we can't avoid it.
>
> So, again, it's way more overhead than userspace just calling
> stat() asycnhronously on many files at once as readdir/gentdents
> returns dirents from the kernel to speed up cache population.
>
> That's my main issue with this patchset - it's implementing
> something in kernelspace that can *easily* be done generically in
> userspace without introducing all sorts of nasty corner cases that
> we have to handle in the kernel. We only add functionality to the kernel if
> there's a
> compelling reason to do it in kernelspace, and right now I just
> don't see any numbers that justify adding readdir+stat() readahead
> or inode number based cache population in kernelspace.
>
> Before we add *any* syscall for directory readahead, we need
> comparison numbers against doing the dumb multithreaded
> userspace readahead of stat() calls. If userspace can do this as
> fast as the kernel can....
>

Hi Dave/all,

I finally got around to playing with the multithreaded userspace readahead
idea and the results are quite promising. I tried to mimic what my kernel
readahead patch did with this userspace program (userspace_ra.c)
Source code here: https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0

Each thread has an associated buffer into which a chunk of directory
entries are read in using getdents(). Each thread then sorts the entries in
inode number order (for GFS2, this is also their disk block order) and proceeds
to cache in the inodes in that order by issuing open(2) syscalls against them.
In my tests, I backgrounded this program and issued an 'ls -l' on the dir
in question. I did the same following the kernel dirreadahead syscall as well.

I did not manage to test out too many parameter combinations for both
userspace_ra and SYS_dirreadahead because the test matrix got pretty big and
time consuming. However, I did notice that without sorting, userspace_ra did
not perform as well in some of my tests. I haven't investigated that yet,
so the numbers shown here are all with sorting enabled.

For a directory with 100000 files,
a) simple 'ls -l' took 14m11s
b) SYS_dirreadahead + 'ls -l' took 3m9s, and
c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s

https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a graph
that contains a few more data points. In the graph, along with data for 'ls -l'
and SYS_dirreadahead, there are six data series for userspace_ra for each
directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is one
of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.

Cheers!
--Abhi

2014-11-10 03:41:31

by Abhi Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call


> >
>
> Hi Dave/all,
>
> I finally got around to playing with the multithreaded userspace readahead
> idea and the results are quite promising. I tried to mimic what my kernel
> readahead patch did with this userspace program (userspace_ra.c)
> Source code here:
> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
>
> Each thread has an associated buffer into which a chunk of directory
> entries are read in using getdents(). Each thread then sorts the entries in
> inode number order (for GFS2, this is also their disk block order) and
> proceeds
> to cache in the inodes in that order by issuing open(2) syscalls against
> them.
> In my tests, I backgrounded this program and issued an 'ls -l' on the dir
> in question. I did the same following the kernel dirreadahead syscall as
> well.
>
> I did not manage to test out too many parameter combinations for both
> userspace_ra and SYS_dirreadahead because the test matrix got pretty big and
> time consuming. However, I did notice that without sorting, userspace_ra did
> not perform as well in some of my tests. I haven't investigated that yet,
> so the numbers shown here are all with sorting enabled.
>
> For a directory with 100000 files,
> a) simple 'ls -l' took 14m11s
> b) SYS_dirreadahead + 'ls -l' took 3m9s, and
> c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s
>
> https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a
> graph
> that contains a few more data points. In the graph, along with data for 'ls
> -l'
> and SYS_dirreadahead, there are six data series for userspace_ra for each
> directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is
> one
> of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.
>

Hi,

Here are some more numbers for larger directories and it seems like userspace
readahead scales well and is still a good option.

I've chosen the best-performing runs for kernel readahead and userspace readahead. I
have data for runs with different parameters (buffer size, number of threads, etc)
that I can provide, if anybody's interested.

The numbers here are total elapsed times for the readahead plus 'ls -l' operations
to complete.

#files in testdir
50k 100k 200k 500k 1m
------------------------------------------------------------------------------------
Readdir 'ls -l' 11 849 1873 5024 10365
Kernel readahead + 'ls -l' (best case) 7 214 814 2330 4900
Userspace MT readahead + 'ls -l' (best case) 12 99 239 1351 4761

Cheers!
--Abhi

2014-11-10 22:23:09

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call

On Nov 9, 2014, at 8:41 PM, Abhijith Das <[email protected]> wrote:
>> Hi Dave/all,
>>
>> I finally got around to playing with the multithreaded userspace readahead
>> idea and the results are quite promising. I tried to mimic what my kernel
>> readahead patch did with this userspace program (userspace_ra.c)
>> Source code here:
>> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
>>
>> Each thread has an associated buffer into which a chunk of directory
>> entries are read in using getdents(). Each thread then sorts the
>> entries in inode number order (for GFS2, this is also their disk
>> block order) and proceeds to cache in the inodes in that order by
>> issuing open(2) syscalls against them. In my tests, I backgrounded
>> this program and issued an 'ls -l' on the dir in question. I did the
>> same following the kernel dirreadahead syscall as well.
>>
>> I did not manage to test out too many parameter combinations for both
>> userspace_ra and SYS_dirreadahead because the test matrix got pretty
>> big and time consuming. However, I did notice that without sorting,
>> userspace_ra did not perform as well in some of my tests. I haven't
>> investigated that, so numbers shown here are all with sorting enabled.

One concern is for filesystems where inode order does not necessarily
match the on-disk order. I believe that filesystems like ext4 and XFS
have matching inode/disk order, but tree-based COW filesystems like
Btrfs do not necessarily preserve this order, so sorting in userspace
will not help and may in fact hurt readahead compared to readdir order.

What filesystem(s) have you tested this besides GFS?

Cheers, Andreas

>> For a directory with 100000 files,
>> a) simple 'ls -l' took 14m11s
>> b) SYS_dirreadahead + 'ls -l' took 3m9s, and
>> c) userspace_ra (1M buffer/thread, 32 threads) took 1m42s
>>
>> https://www.dropbox.com/s/85na3hmo3qrtib1/ra_vs_u_ra_vs_ls.jpg?dl=0 is a
>> graph
>> that contains a few more data points. In the graph, along with data for 'ls
>> -l'
>> and SYS_dirreadahead, there are six data series for userspace_ra for each
>> directory size (10K, 100K and 200K files). i.e. u_ra:XXX,YYY, where XXX is
>> one
>> of (64K, 1M) buffer size and YYY is one of (4, 16, 32) threads.
>>
>
> Hi,
>
> Here are some more numbers for larger directories and it seems like
> userspace readahead scales well and is still a good option.
>
> I've chosen the best-performing runs for kernel readahead and userspace
> readahead. I have data for runs with different parameters (buffer size,
> number of threads, etc) that I can provide, if anybody's interested.
>
> The numbers here are total elapsed times for the readahead plus 'ls -l'
> operations to complete.
>
> #files in testdir
> 50k 100k 200k 500k 1m
> ------------------------------------------------------------------------------------
> Readdir 'ls -l' 11 849 1873 5024 10365
> Kernel readahead + 'ls -l' (best case) 7 214 814 2330 4900
> Userspace MT readahead + 'ls -l' (best case) 12 99 239 1351 4761
>
> Cheers!
> --Abhi


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-11-10 22:47:46

by Abhi Das

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] dirreadahead system call



----- Original Message -----
> From: "Andreas Dilger" <[email protected]>
> To: "Abhijith Das" <[email protected]>
> Cc: "Dave Chinner" <[email protected]>, "LKML" <[email protected]>, "linux-fsdevel"
> <[email protected]>, [email protected], "Steven Whitehouse" <[email protected]>
> Sent: Monday, November 10, 2014 4:23:25 PM
> Subject: Re: [RFC PATCH 0/2] dirreadahead system call
>
> On Nov 9, 2014, at 8:41 PM, Abhijith Das <[email protected]> wrote:
> >> Hi Dave/all,
> >>
> >> I finally got around to playing with the multithreaded userspace readahead
> >> idea and the results are quite promising. I tried to mimic what my kernel
> >> readahead patch did with this userspace program (userspace_ra.c)
> >> Source code here:
> >> https://www.dropbox.com/s/am9q26ndoiw1cdr/userspace_ra.c?dl=0
> >>
> >> Each thread has an associated buffer into which a chunk of directory
> >> entries are read in using getdents(). Each thread then sorts the
> >> entries in inode number order (for GFS2, this is also their disk
> >> block order) and proceeds to cache in the inodes in that order by
> >> issuing open(2) syscalls against them. In my tests, I backgrounded
> >> this program and issued an 'ls -l' on the dir in question. I did the
> >> same following the kernel dirreadahead syscall as well.
> >>
> >> I did not manage to test out too many parameter combinations for both
> >> userspace_ra and SYS_dirreadahead because the test matrix got pretty
> >> big and time consuming. However, I did notice that without sorting,
> >> userspace_ra did not perform as well in some of my tests. I haven't
> >> investigated that, so numbers shown here are all with sorting enabled.
>
> One concern is for filesystems where inode order does not necessarily
> match the on-disk order. I believe that filesystems like ext4 and XFS
> have matching inode/disk order, but tree-based COW filesystems like
> Btrfs do not necessarily preserve this order, so sorting in userspace
> will not help and may in fact hurt readahead compared to readdir order.

Yes. I have seen in my tests that the multi-threaded user program performs
better than plain readdir + 'ls -l' with the sorting turned off, but not
as well as the kernel readahead. I'd imagine that if this ends up in glibc
or something, the sorting can be turned on and off based on whether the
filesystem's inode numbers and disk block order match.

>
> What filesystem(s) have you tested this besides GFS?

I did some brief runs on ext4, in addition to GFS2 with similar results,
but I had to repurpose my machines, so I haven't had a chance to try out
other fses. Hopefully, I'll get around to it soon.

>
> Cheers, Andreas
>

Cheers!
--Abhi