From: Shaohua Li <[email protected]>
Hi,
Currently blktrace isn't cgroup aware. blktrace prints out task name of current
context, but the task of current context isn't always in the cgroup where the
BIO comes from. We can't use task name to find out IO cgroup. For example,
Writeback BIOs always comes from flusher thread but the BIOs are for different
blk cgroups. Request could be requeued and dispatched from completely different
tasks. MD/DM are another examples. This brings challenges if we want to use
blktrace for performance tunning with cgroup enabled.
This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
use a BPF program to filter out blktrace for a specific cgroup.
The first 5 patches adds export operation handlers for kernfs, so userspace can
use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
blktrace output cgroup info.
Thanks,
Shaohua
Shaohua Li (11):
kernfs: implement i_generation
kernfs: use idr instead of ida to manage inode number
kernfs: add an API to get kernfs node from inode number
kernfs: don't set dentry->d_fsdata
kernfs: add exportfs operations
cgroup: export fhandle info for a cgroup
blktrace: export cgroup info in trace
block: always attach cgroup info into bio
block: call __bio_free in bio_endio
blktrace: add an option to allow displying cgroup path
block: use standard blktrace API to output cgroup info for debug notes
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 2 +-
block/bfq-iosched.h | 16 +-
block/bio-integrity.c | 1 +
block/bio.c | 1 +
block/blk-throttle.c | 13 +-
block/cfq-iosched.c | 15 +-
fs/kernfs/dir.c | 80 +++++++---
fs/kernfs/file.c | 6 +-
fs/kernfs/inode.c | 7 +-
fs/kernfs/kernfs-internal.h | 4 +
fs/kernfs/mount.c | 107 +++++++++++--
fs/kernfs/symlink.c | 6 +-
fs/sysfs/mount.c | 2 +-
include/linux/blk-cgroup.h | 16 +-
include/linux/blktrace_api.h | 12 +-
include/linux/cgroup-defs.h | 2 +
include/linux/cgroup.h | 14 ++
include/linux/kernfs.h | 23 ++-
include/uapi/linux/blktrace_api.h | 3 +
kernel/cgroup/cgroup.c | 18 ++-
kernel/trace/blktrace.c | 254 +++++++++++++++++++++----------
21 files changed, 430 insertions(+), 172 deletions(-)
--
2.9.3
From: Shaohua Li <[email protected]>
kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 17 ++++++++++++-----
include/linux/kernfs.h | 2 +-
2 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 09d093e..8e8545a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -21,6 +21,7 @@
DEFINE_MUTEX(kernfs_mutex);
static DEFINE_SPINLOCK(kernfs_rename_lock); /* kn->parent and ->name */
static char kernfs_pr_cont_buf[PATH_MAX]; /* protected by rename_lock */
+static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
#define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)
@@ -533,7 +534,9 @@ void kernfs_put(struct kernfs_node *kn)
simple_xattrs_free(&kn->iattr->xattrs);
}
kfree(kn->iattr);
- ida_simple_remove(&root->ino_ida, kn->ino);
+ spin_lock(&kernfs_idr_lock);
+ idr_remove(&root->ino_idr, kn->ino);
+ spin_unlock(&kernfs_idr_lock);
kmem_cache_free(kernfs_node_cache, kn);
kn = parent;
@@ -542,7 +545,7 @@ void kernfs_put(struct kernfs_node *kn)
goto repeat;
} else {
/* just released the root kn, free @root too */
- ida_destroy(&root->ino_ida);
+ idr_destroy(&root->ino_idr);
kfree(root);
}
}
@@ -630,7 +633,11 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
if (!kn)
goto err_out1;
- ret = ida_simple_get(&root->ino_ida, 1, 0, GFP_KERNEL);
+ idr_preload(GFP_KERNEL);
+ spin_lock(&kernfs_idr_lock);
+ ret = idr_alloc(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+ spin_unlock(&kernfs_idr_lock);
+ idr_preload_end();
if (ret < 0)
goto err_out2;
kn->ino = ret;
@@ -876,14 +883,14 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
if (!root)
return ERR_PTR(-ENOMEM);
- ida_init(&root->ino_ida);
+ idr_init(&root->ino_idr);
INIT_LIST_HEAD(&root->supers);
atomic_set(&root->next_generation, 0);
kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
KERNFS_DIR);
if (!kn) {
- ida_destroy(&root->ino_ida);
+ idr_destroy(&root->ino_idr);
kfree(root);
return ERR_PTR(-ENOMEM);
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index c5f0fa7..61668d1 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -164,7 +164,7 @@ struct kernfs_root {
unsigned int flags; /* KERNFS_ROOT_* flags */
/* private fields, do not use outside kernfs proper */
- struct ida ino_ida;
+ struct idr ino_idr;
struct kernfs_syscall_ops *syscall_ops;
/* list of kernfs_super_info of this root, protected by kernfs_mutex */
--
2.9.3
From: Shaohua Li <[email protected]>
When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 26 ++++++++++----------------
fs/kernfs/file.c | 6 +++---
fs/kernfs/inode.c | 6 +++---
fs/kernfs/kernfs-internal.h | 2 ++
fs/kernfs/mount.c | 8 ++------
fs/kernfs/symlink.c | 6 +++---
6 files changed, 23 insertions(+), 31 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 4c86e4c..f5e9376 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -562,7 +562,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
if (d_really_is_negative(dentry))
goto out_bad_unlocked;
- kn = dentry->d_fsdata;
+ kn = kernfs_dentry_node(dentry);
mutex_lock(&kernfs_mutex);
/* The kernfs node has been deactivated */
@@ -570,7 +570,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
goto out_bad;
/* The kernfs node has been moved? */
- if (dentry->d_parent->d_fsdata != kn->parent)
+ if (d_really_is_negative(dentry->d_parent) ||
+ kernfs_dentry_node(dentry->d_parent) != kn->parent)
goto out_bad;
/* The kernfs node has been renamed */
@@ -590,14 +591,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
return 0;
}
-static void kernfs_dop_release(struct dentry *dentry)
-{
- kernfs_put(dentry->d_fsdata);
-}
-
const struct dentry_operations kernfs_dops = {
.d_revalidate = kernfs_dop_revalidate,
- .d_release = kernfs_dop_release,
};
/**
@@ -613,8 +608,9 @@ const struct dentry_operations kernfs_dops = {
*/
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
{
- if (dentry->d_sb->s_op == &kernfs_sops)
- return dentry->d_fsdata;
+ if (dentry->d_sb->s_op == &kernfs_sops &&
+ !d_really_is_negative(dentry))
+ return kernfs_dentry_node(dentry);
return NULL;
}
@@ -1028,7 +1024,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
unsigned int flags)
{
struct dentry *ret;
- struct kernfs_node *parent = dentry->d_parent->d_fsdata;
+ struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
struct inode *inode;
const void *ns = NULL;
@@ -1045,8 +1041,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
ret = NULL;
goto out_unlock;
}
- kernfs_get(kn);
- dentry->d_fsdata = kn;
/* attach dentry and inode */
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1083,7 +1077,7 @@ static int kernfs_iop_mkdir(struct inode *dir, struct dentry *dentry,
static int kernfs_iop_rmdir(struct inode *dir, struct dentry *dentry)
{
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(dentry);
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;
@@ -1103,7 +1097,7 @@ static int kernfs_iop_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry,
unsigned int flags)
{
- struct kernfs_node *kn = old_dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(old_dentry);
struct kernfs_node *new_parent = new_dir->i_private;
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;
@@ -1616,7 +1610,7 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file->f_path.dentry;
- struct kernfs_node *parent = dentry->d_fsdata;
+ struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
const void *ns = NULL;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ac2dfe0..7f90d4d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -616,7 +616,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
static int kernfs_fop_open(struct inode *inode, struct file *file)
{
- struct kernfs_node *kn = file->f_path.dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root = kernfs_root(kn);
const struct kernfs_ops *ops;
struct kernfs_open_file *of;
@@ -768,7 +768,7 @@ static void kernfs_release_file(struct kernfs_node *kn,
static int kernfs_fop_release(struct inode *inode, struct file *filp)
{
- struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
struct kernfs_open_file *of = kernfs_of(filp);
if (kn->flags & KERNFS_HAS_RELEASE) {
@@ -835,7 +835,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
static unsigned int kernfs_fop_poll(struct file *filp, poll_table *wait)
{
struct kernfs_open_file *of = kernfs_of(filp);
- struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry);
struct kernfs_open_node *on = kn->attr.open;
if (!kernfs_get_active(kn))
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 79cdae4..4c8b510 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -112,7 +112,7 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr)
{
struct inode *inode = d_inode(dentry);
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
int error;
if (!kn)
@@ -154,7 +154,7 @@ static int kernfs_node_setsecdata(struct kernfs_iattrs *attrs, void **secdata,
ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size)
{
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(dentry);
struct kernfs_iattrs *attrs;
attrs = kernfs_iattrs(kn);
@@ -203,8 +203,8 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
- struct kernfs_node *kn = path->dentry->d_fsdata;
struct inode *inode = d_inode(path->dentry);
+ struct kernfs_node *kn = inode->i_private;
mutex_lock(&kernfs_mutex);
kernfs_refresh_inode(kn, inode);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 3534cfe..82e11fa 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -70,6 +70,8 @@ struct kernfs_super_info {
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
+#define kernfs_dentry_node(d) ((d_inode(d))->i_private)
+
extern const struct super_operations kernfs_sops;
extern struct kmem_cache *kernfs_node_cache;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 343dfeb..462a40c 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -33,7 +33,7 @@ static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data)
static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
{
- struct kernfs_root *root = kernfs_root(dentry->d_fsdata);
+ struct kernfs_root *root = kernfs_root(kernfs_dentry_node(dentry));
struct kernfs_syscall_ops *scops = root->syscall_ops;
if (scops && scops->show_options)
@@ -43,7 +43,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
{
- struct kernfs_node *node = dentry->d_fsdata;
+ struct kernfs_node *node = kernfs_dentry_node(dentry);
struct kernfs_root *root = kernfs_root(node);
struct kernfs_syscall_ops *scops = root->syscall_ops;
@@ -176,8 +176,6 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
pr_debug("%s: could not get root dentry!\n", __func__);
return -ENOMEM;
}
- kernfs_get(info->root->kn);
- root->d_fsdata = info->root->kn;
sb->s_root = root;
sb->s_d_op = &kernfs_dops;
return 0;
@@ -283,7 +281,6 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
void kernfs_kill_sb(struct super_block *sb)
{
struct kernfs_super_info *info = kernfs_info(sb);
- struct kernfs_node *root_kn = sb->s_root->d_fsdata;
mutex_lock(&kernfs_mutex);
list_del(&info->node);
@@ -295,7 +292,6 @@ void kernfs_kill_sb(struct super_block *sb)
*/
kill_anon_super(sb);
kfree(info);
- kernfs_put(root_kn);
}
/**
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 1684af4..08ccabd 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -98,9 +98,9 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
return 0;
}
-static int kernfs_getlink(struct dentry *dentry, char *path)
+static int kernfs_getlink(struct inode *inode, char *path)
{
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
int error;
@@ -124,7 +124,7 @@ static const char *kernfs_iop_get_link(struct dentry *dentry,
body = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!body)
return ERR_PTR(-ENOMEM);
- error = kernfs_getlink(dentry, body);
+ error = kernfs_getlink(inode, body);
if (unlikely(error < 0)) {
kfree(body);
return ERR_PTR(error);
--
2.9.3
From: Shaohua Li <[email protected]>
Add an API to export cgroup fhandle info. We don't export a full 'struct
file_handle', there are unrequired info. Sepcifically, cgroup is always
a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
we only need export the inode number and generation number just like
what generic_fh_to_parent does. And we can avoid the overhead of getting
an inode too, since kernfs_node has all the info required.
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/mount.c | 11 +++++++++++
include/linux/cgroup-defs.h | 2 ++
include/linux/cgroup.h | 8 ++++++++
include/linux/kernfs.h | 8 ++++++++
kernel/cgroup/cgroup.c | 3 +++
5 files changed, 32 insertions(+)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 11c5aba..d24d816 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -65,6 +65,17 @@ const struct super_operations kernfs_sops = {
.show_path = kernfs_sop_show_path,
};
+/*
+ * A special version of export_encode_fh(). This will avoid to get inode and
+ * then do the fhandle encoding. This function must match with export_encode_fh
+ * and the kernfs node should be a directory.
+ */
+void kernfs_encode_node_id(struct kernfs_node *kn, struct kernfs_node_id *id)
+{
+ id->ino = kn->ino;
+ id->gen = kn->generation;
+}
+
static struct inode *kernfs_fh_get_inode(struct super_block *sb,
u64 ino, u32 generation)
{
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 2174594..8b6d9e2 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -308,6 +308,8 @@ struct cgroup {
/* used to store eBPF programs */
struct cgroup_bpf bpf;
+ struct kernfs_node_id node_id;
+
/* ids of the ancestors at each level including self */
int ancestor_ids[];
};
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed2573e..c30dda8 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -589,6 +589,10 @@ static inline void cgroup_kthread_ready(void)
current->no_cgroup_migration = 0;
}
+static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
+{
+ return &cgrp->node_id;
+}
#else /* !CONFIG_CGROUPS */
struct cgroup_subsys_state;
@@ -611,6 +615,10 @@ static inline int cgroup_init_early(void) { return 0; }
static inline int cgroup_init(void) { return 0; }
static inline void cgroup_init_kthreadd(void) {}
static inline void cgroup_kthread_ready(void) {}
+static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
+{
+ return NULL;
+}
static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
struct cgroup *ancestor)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 15c805f..932d89f 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -247,6 +247,12 @@ struct kernfs_ops {
#endif
};
+/* match with 'struct fid' */
+struct kernfs_node_id {
+ u32 ino;
+ u32 gen;
+};
+
#ifdef CONFIG_KERNFS
static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -339,6 +345,8 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
void kernfs_init(void);
+void kernfs_encode_node_id(struct kernfs_node *kn, struct kernfs_node_id *id);
+
#else /* CONFIG_KERNFS */
static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 206d8df..489672d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1692,6 +1692,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
goto exit_root_id;
}
root_cgrp->kn = root->kf_root->kn;
+ kernfs_encode_node_id(root_cgrp->kn, &root_cgrp->node_id);
ret = css_populate_dir(&root_cgrp->self);
if (ret)
@@ -4209,6 +4210,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
/* let's create and online css's */
kernfs_activate(kn);
+ kernfs_encode_node_id(kn, &cgrp->node_id);
+
ret = 0;
goto out_unlock;
--
2.9.3
From: Shaohua Li <[email protected]>
Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.
Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.
Signed-off-by: Shaohua Li <[email protected]>
---
block/bfq-iosched.h | 16 ++++++----------
block/blk-throttle.c | 6 ++----
block/cfq-iosched.c | 15 ++++++---------
include/linux/blk-cgroup.h | 14 --------------
include/linux/blktrace_api.h | 12 ++++++++----
kernel/trace/blktrace.c | 10 +++++++---
6 files changed, 29 insertions(+), 44 deletions(-)
diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index ae783c0..e4d5b56 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -910,19 +910,15 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
struct bfq_group *bfqq_group(struct bfq_queue *bfqq);
#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
- char __pbuf[128]; \
- \
- blkg_path(bfqg_to_blkg(bfqq_group(bfqq)), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid, \
- bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \
- __pbuf, ##args); \
+ blk_add_cgroup_trace_msg((bfqd)->queue, \
+ bfqg_to_blkg(bfqq_group(bfqq))->blkcg, \
+ "bfq%d%c " fmt, (bfqq)->pid, \
+ bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
} while (0)
#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do { \
- char __pbuf[128]; \
- \
- blkg_path(bfqg_to_blkg(bfqg), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg((bfqd)->queue, "%s " fmt, __pbuf, ##args); \
+ blk_add_cgroup_trace_msg((bfqd)->queue, \
+ bfqg_to_blkg(bfqg)->blkcg, fmt, ##args); \
} while (0)
#else /* CONFIG_BFQ_GROUP_IOSCHED */
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 53d3e3d..46e85e0 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -352,10 +352,8 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
if (likely(!blk_trace_note_message_enabled(__td->queue))) \
break; \
if ((__tg)) { \
- char __pbuf[128]; \
- \
- blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, ##args); \
+ blk_add_cgroup_trace_msg(__td->queue, \
+ tg_to_blkg(__tg)->blkcg, "throtl " fmt, ##args);\
} else { \
blk_add_trace_msg(__td->queue, "throtl " fmt, ##args); \
} \
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index da69b07..5f59f37 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -652,20 +652,17 @@ static inline void cfqg_put(struct cfq_group *cfqg)
}
#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) do { \
- char __pbuf[128]; \
- \
- blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+ blk_add_cgroup_trace_msg((cfqd)->queue, \
+ cfqg_to_blkg((cfqq)->cfqg)->blkcg, \
+ "cfq%d%c%c " fmt, (cfqq)->pid, \
cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
- __pbuf, ##args); \
+ ##args); \
} while (0)
#define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do { \
- char __pbuf[128]; \
- \
- blkg_path(cfqg_to_blkg(cfqg), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg((cfqd)->queue, "%s " fmt, __pbuf, ##args); \
+ blk_add_cgroup_trace_msg((cfqd)->queue, \
+ cfqg_to_blkg(cfqg)->blkcg, fmt, ##args); \
} while (0)
static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index d176247..b06f107 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -334,19 +334,6 @@ static inline struct blkcg *cpd_to_blkcg(struct blkcg_policy_data *cpd)
}
/**
- * blkg_path - format cgroup path of blkg
- * @blkg: blkg of interest
- * @buf: target buffer
- * @buflen: target buffer length
- *
- * Format the path of the cgroup of @blkg into @buf.
- */
-static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
-{
- return cgroup_path(blkg->blkcg->css.cgroup, buf, buflen);
-}
-
-/**
* blkg_get - get a blkg reference
* @blkg: blkg to get
*
@@ -758,7 +745,6 @@ static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
struct blkcg_policy *pol) { return NULL; }
static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
-static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
static inline void blkg_get(struct blkcg_gq *blkg) { }
static inline void blkg_put(struct blkcg_gq *blkg) { }
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index d2e9085..cbfadd0 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -28,10 +28,12 @@ struct blk_trace {
atomic_t dropped;
};
+struct blkcg;
+
extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
extern void blk_trace_shutdown(struct request_queue *);
-extern __printf(2, 3)
-void __trace_note_message(struct blk_trace *, const char *fmt, ...);
+extern __printf(3, 4)
+void __trace_note_message(struct blk_trace *, struct blkcg *blkcg, const char *fmt, ...);
/**
* blk_add_trace_msg - Add a (simple) message to the blktrace stream
@@ -46,12 +48,14 @@ void __trace_note_message(struct blk_trace *, const char *fmt, ...);
* NOTE: Can not use 'static inline' due to presence of var args...
*
**/
-#define blk_add_trace_msg(q, fmt, ...) \
+#define blk_add_cgroup_trace_msg(q, cg, fmt, ...) \
do { \
struct blk_trace *bt = (q)->blk_trace; \
if (unlikely(bt)) \
- __trace_note_message(bt, fmt, ##__VA_ARGS__); \
+ __trace_note_message(bt, cg, fmt, ##__VA_ARGS__);\
} while (0)
+#define blk_add_trace_msg(q, fmt, ...) \
+ blk_add_cgroup_trace_msg(q, NULL, fmt, ##__VA_ARGS__)
#define BLK_TN_MAX_MSG 128
static inline bool blk_trace_note_message_enabled(struct request_queue *q)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 18cbc02..9e135dd 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -154,7 +154,8 @@ static void trace_note_time(struct blk_trace *bt)
local_irq_restore(flags);
}
-void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
+void __trace_note_message(struct blk_trace *bt, struct blkcg *blkcg,
+ const char *fmt, ...)
{
int n;
va_list args;
@@ -178,7 +179,10 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
va_end(args);
- trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
+ if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+ blkcg = NULL;
+ trace_note(bt, 0, BLK_TN_MESSAGE, buf, n,
+ blkcg ? cgroup_get_node_id(blkcg->css.cgroup) : NULL);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -375,7 +379,7 @@ static ssize_t blk_msg_write(struct file *filp, const char __user *buffer,
return PTR_ERR(msg);
bt = filp->private_data;
- __trace_note_message(bt, "%s", msg);
+ __trace_note_message(bt, NULL, "%s", msg);
kfree(msg);
return count;
--
2.9.3
From: Shaohua Li <[email protected]>
By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.
with the option enabled, blktrace will output something like this:
dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd]
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/mount.c | 19 +++++++++++++++++++
include/linux/cgroup.h | 6 ++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup/cgroup.c | 12 ++++++++++++
kernel/trace/blktrace.c | 14 +++++++++++++-
5 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d24d816..4c1636e 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -76,6 +76,25 @@ void kernfs_encode_node_id(struct kernfs_node *kn, struct kernfs_node_id *id)
id->gen = kn->generation;
}
+/*
+ * Similar like kernfs_fh_get_inode, this one gets kernfs node from inode
+ * number and generation
+ */
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+ const struct kernfs_node_id *id)
+{
+ struct kernfs_node *kn;
+
+ kn = kernfs_get_node_by_ino(root, id->ino);
+ if (!kn)
+ return NULL;
+ if (kn->generation != id->gen) {
+ kernfs_put(kn);
+ return NULL;
+ }
+ return kn;
+}
+
static struct inode *kernfs_fh_get_inode(struct super_block *sb,
u64 ino, u32 generation)
{
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c30dda8..56390c8 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -593,6 +593,9 @@ static inline struct kernfs_node_id *cgroup_get_node_id(struct cgroup *cgrp)
{
return &cgrp->node_id;
}
+
+void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+ char *buf, size_t buflen);
#else /* !CONFIG_CGROUPS */
struct cgroup_subsys_state;
@@ -625,6 +628,9 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
{
return true;
}
+
+static inline void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+ char *buf, size_t buflen) {}
#endif /* !CONFIG_CGROUPS */
/*
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 932d89f..1c9397d 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -346,6 +346,8 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
void kernfs_init(void);
void kernfs_encode_node_id(struct kernfs_node *kn, struct kernfs_node_id *id);
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+ const struct kernfs_node_id *id);
#else /* CONFIG_KERNFS */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 489672d..7f91ad1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4604,6 +4604,18 @@ static int __init cgroup_wq_init(void)
}
core_initcall(cgroup_wq_init);
+void cgroup_path_from_node_id(const struct kernfs_node_id *id,
+ char *buf, size_t buflen)
+{
+ struct kernfs_node *kn;
+
+ kn = kernfs_get_node_by_id(cgrp_dfl_root.kf_root, id);
+ if (!kn)
+ return;
+ kernfs_path(kn, buf, buflen);
+ kernfs_put(kn);
+}
+
/*
* proc_cgroup_show()
* - Print task's cgroup paths into seq_file, one line for each hierarchy
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index e351837..18cbc02 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -48,12 +48,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
/* Select an alternative, minimalistic output than the original one */
#define TRACE_BLK_OPT_CLASSIC 0x1
#define TRACE_BLK_OPT_CGROUP 0x2
+#define TRACE_BLK_OPT_CGNAME 0x4
static struct tracer_opt blk_tracer_opts[] = {
/* Default disable the minimalistic output */
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
#ifdef CONFIG_BLK_CGROUP
{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+ { TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) },
#endif
{ }
};
@@ -1212,7 +1214,17 @@ static void blk_log_action(struct trace_iterator *iter, const char *act,
if (has_cg) {
const struct kernfs_node_id *id = cgid_start(iter->ent);
- trace_seq_printf(&iter->seq, "%3d,%-3d %8x,%-8x %2s %3s ",
+ if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) {
+ char blkcg_name_buf[NAME_MAX + 1] = "<...>";
+
+ cgroup_path_from_node_id(id, blkcg_name_buf,
+ sizeof(blkcg_name_buf));
+ trace_seq_printf(&iter->seq, "%3d,%-3d %s %2s %3s ",
+ MAJOR(t->device), MINOR(t->device),
+ blkcg_name_buf, act, rwbs);
+ } else
+ trace_seq_printf(&iter->seq,
+ "%3d,%-3d %8x,%-8x %2s %3s ",
MAJOR(t->device), MINOR(t->device),
id->ino, id->gen, act, rwbs);
} else
--
2.9.3
From: Shaohua Li <[email protected]>
Currently blktrace isn't cgroup aware. blktrace prints out task name of
current context, but the task of current context isn't always in the
cgroup where the BIO comes from. We can't use task name to find out IO
cgroup. For example, Writeback BIOs always comes from flusher thread but
the BIOs are for different blk cgroups. Request could be requeued and
dispatched from completely different tasks. MD/DM are another examples.
This patch tries to fix the gap. We print out cgroup fhandle info in
blktrace. Userspace can use open_by_handle_at() syscall to find the
cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to
find fhandle for a cgroup and use a BPF program to filter out blktrace
for a specific cgroup.
We add a new 'blk_cgroup' trace option for blk tracer. It's default off.
Application which doesn't know the new option isn't affected. When it's
on, we output fhandle info right after blk_io_trace with an extra bit
set in event action. So from application point of view, blktrace with
the option will output new actions.
I didn't change blk trace event yet, since I'm not sure if changing the
trace event output is an ABI issue. If not, I'll do it later.
Signed-off-by: Shaohua Li <[email protected]>
---
include/uapi/linux/blktrace_api.h | 3 +
kernel/trace/blktrace.c | 230 ++++++++++++++++++++++++++------------
2 files changed, 160 insertions(+), 73 deletions(-)
diff --git a/include/uapi/linux/blktrace_api.h b/include/uapi/linux/blktrace_api.h
index c590ca6..9cdaede 100644
--- a/include/uapi/linux/blktrace_api.h
+++ b/include/uapi/linux/blktrace_api.h
@@ -52,6 +52,7 @@ enum blktrace_act {
__BLK_TA_REMAP, /* bio was remapped */
__BLK_TA_ABORT, /* request aborted */
__BLK_TA_DRV_DATA, /* driver-specific binary data */
+ __BLK_TA_CGROUP = 1 << 8, /* from a cgroup*/
};
/*
@@ -61,6 +62,7 @@ enum blktrace_notify {
__BLK_TN_PROCESS = 0, /* establish pid/name mapping */
__BLK_TN_TIMESTAMP, /* include system clock */
__BLK_TN_MESSAGE, /* Character string message */
+ __BLK_TN_CGROUP = __BLK_TA_CGROUP, /* from a cgroup */
};
@@ -107,6 +109,7 @@ struct blk_io_trace {
__u32 cpu; /* on what cpu did it happen */
__u16 error; /* completion error */
__u16 pdu_len; /* length of data after this trace */
+ /* cgroup id will be stored here if exists */
};
/*
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 193c5f5..e351837 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -27,6 +27,7 @@
#include <linux/time.h>
#include <linux/uaccess.h>
#include <linux/list.h>
+#include <linux/blk-cgroup.h>
#include "../../block/blk.h"
@@ -46,10 +47,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
/* Select an alternative, minimalistic output than the original one */
#define TRACE_BLK_OPT_CLASSIC 0x1
+#define TRACE_BLK_OPT_CGROUP 0x2
static struct tracer_opt blk_tracer_opts[] = {
/* Default disable the minimalistic output */
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
+#ifdef CONFIG_BLK_CGROUP
+ { TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+#endif
{ }
};
@@ -68,7 +73,8 @@ static void blk_unregister_tracepoints(void);
* Send out a notify message.
*/
static void trace_note(struct blk_trace *bt, pid_t pid, int action,
- const void *data, size_t len)
+ const void *data, size_t len,
+ struct kernfs_node_id *cgid)
{
struct blk_io_trace *t;
struct ring_buffer_event *event = NULL;
@@ -76,12 +82,13 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
int pc = 0;
int cpu = smp_processor_id();
bool blk_tracer = blk_tracer_enabled;
+ ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;
if (blk_tracer) {
buffer = blk_tr->trace_buffer.buffer;
pc = preempt_count();
event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
- sizeof(*t) + len,
+ sizeof(*t) + len + cgid_len,
0, pc);
if (!event)
return;
@@ -92,17 +99,19 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
if (!bt->rchan)
return;
- t = relay_reserve(bt->rchan, sizeof(*t) + len);
+ t = relay_reserve(bt->rchan, sizeof(*t) + len + cgid_len);
if (t) {
t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
t->time = ktime_to_ns(ktime_get());
record_it:
t->device = bt->dev;
- t->action = action;
+ t->action = action | (cgid ? __BLK_TN_CGROUP : 0);
t->pid = pid;
t->cpu = cpu;
- t->pdu_len = len;
- memcpy((void *) t + sizeof(*t), data, len);
+ t->pdu_len = len + cgid_len;
+ if (cgid)
+ memcpy((void *)t + sizeof(*t), cgid, cgid_len);
+ memcpy((void *) t + sizeof(*t) + cgid_len, data, len);
if (blk_tracer)
trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -122,7 +131,7 @@ static void trace_note_tsk(struct task_struct *tsk)
spin_lock_irqsave(&running_trace_lock, flags);
list_for_each_entry(bt, &running_trace_list, running_list) {
trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm,
- sizeof(tsk->comm));
+ sizeof(tsk->comm), NULL);
}
spin_unlock_irqrestore(&running_trace_lock, flags);
}
@@ -139,7 +148,7 @@ static void trace_note_time(struct blk_trace *bt)
words[1] = now.tv_nsec;
local_irq_save(flags);
- trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words));
+ trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words), NULL);
local_irq_restore(flags);
}
@@ -167,7 +176,7 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
va_end(args);
- trace_note(bt, 0, BLK_TN_MESSAGE, buf, n);
+ trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -204,7 +213,7 @@ static const u32 ddir_act[2] = { BLK_TC_ACT(BLK_TC_READ),
*/
static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
int op, int op_flags, u32 what, int error, int pdu_len,
- void *pdu_data)
+ void *pdu_data, struct kernfs_node_id *cgid)
{
struct task_struct *tsk = current;
struct ring_buffer_event *event = NULL;
@@ -215,6 +224,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
pid_t pid;
int cpu, pc = 0;
bool blk_tracer = blk_tracer_enabled;
+ ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;
if (unlikely(bt->trace_state != Blktrace_running && !blk_tracer))
return;
@@ -229,6 +239,8 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
what |= BLK_TC_ACT(BLK_TC_DISCARD);
if (op == REQ_OP_FLUSH)
what |= BLK_TC_ACT(BLK_TC_FLUSH);
+ if (cgid)
+ what |= __BLK_TA_CGROUP;
pid = tsk->pid;
if (act_log_check(bt, what, sector, pid))
@@ -241,7 +253,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
buffer = blk_tr->trace_buffer.buffer;
pc = preempt_count();
event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
- sizeof(*t) + pdu_len,
+ sizeof(*t) + pdu_len + cgid_len,
0, pc);
if (!event)
return;
@@ -258,7 +270,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
* from coming in and stepping on our toes.
*/
local_irq_save(flags);
- t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
+ t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len + cgid_len);
if (t) {
sequence = per_cpu_ptr(bt->sequence, cpu);
@@ -280,10 +292,12 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
t->action = what;
t->device = bt->dev;
t->error = error;
- t->pdu_len = pdu_len;
+ t->pdu_len = pdu_len + cgid_len;
+ if (cgid_len)
+ memcpy((void *)t + sizeof(*t), cgid, cgid_len);
if (pdu_len)
- memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
+ memcpy((void *)t + sizeof(*t) + cgid_len, pdu_data, pdu_len);
if (blk_tracer) {
trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -684,6 +698,36 @@ void blk_trace_shutdown(struct request_queue *q)
}
}
+#ifdef CONFIG_BLK_CGROUP
+static struct kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+ struct blk_trace *bt = q->blk_trace;
+
+ if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+ return NULL;
+
+ if (!bio->bi_css)
+ return NULL;
+ return cgroup_get_node_id(bio->bi_css->cgroup);
+}
+#else
+static struct kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+ return NULL;
+}
+#endif
+
+static struct kernfs_node_id *
+blk_trace_request_get_cgid(struct request_queue *q, struct request *rq)
+{
+ if (!rq->bio)
+ return NULL;
+ /* Use the first bio */
+ return blk_trace_bio_get_cgid(q, rq->bio);
+}
+
/*
* blktrace probes
*/
@@ -700,7 +744,8 @@ void blk_trace_shutdown(struct request_queue *q)
*
**/
static void blk_add_trace_rq(struct request *rq, int error,
- unsigned int nr_bytes, u32 what)
+ unsigned int nr_bytes, u32 what,
+ struct kernfs_node_id *cgid)
{
struct blk_trace *bt = rq->q->blk_trace;
@@ -713,32 +758,36 @@ static void blk_add_trace_rq(struct request *rq, int error,
what |= BLK_TC_ACT(BLK_TC_FS);
__blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, req_op(rq),
- rq->cmd_flags, what, error, 0, NULL);
+ rq->cmd_flags, what, error, 0, NULL, cgid);
}
static void blk_add_trace_rq_insert(void *ignore,
struct request_queue *q, struct request *rq)
{
- blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT);
+ blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT,
+ blk_trace_request_get_cgid(q, rq));
}
static void blk_add_trace_rq_issue(void *ignore,
struct request_queue *q, struct request *rq)
{
- blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE);
+ blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE,
+ blk_trace_request_get_cgid(q, rq));
}
static void blk_add_trace_rq_requeue(void *ignore,
struct request_queue *q,
struct request *rq)
{
- blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE);
+ blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE,
+ blk_trace_request_get_cgid(q, rq));
}
static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
int error, unsigned int nr_bytes)
{
- blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE);
+ blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE,
+ blk_trace_request_get_cgid(rq->q, rq));
}
/**
@@ -753,7 +802,7 @@ static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
*
**/
static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
- u32 what, int error)
+ u32 what, int error, struct kernfs_node_id *cgid)
{
struct blk_trace *bt = q->blk_trace;
@@ -761,20 +810,22 @@ static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
return;
__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
- bio_op(bio), bio->bi_opf, what, error, 0, NULL);
+ bio_op(bio), bio->bi_opf, what, error, 0, NULL, cgid);
}
static void blk_add_trace_bio_bounce(void *ignore,
struct request_queue *q, struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}
static void blk_add_trace_bio_complete(void *ignore,
struct request_queue *q, struct bio *bio,
int error)
{
- blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error);
+ blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error,
+ blk_trace_bio_get_cgid(q, bio));
}
static void blk_add_trace_bio_backmerge(void *ignore,
@@ -782,7 +833,8 @@ static void blk_add_trace_bio_backmerge(void *ignore,
struct request *rq,
struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}
static void blk_add_trace_bio_frontmerge(void *ignore,
@@ -790,13 +842,15 @@ static void blk_add_trace_bio_frontmerge(void *ignore,
struct request *rq,
struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}
static void blk_add_trace_bio_queue(void *ignore,
struct request_queue *q, struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}
static void blk_add_trace_getrq(void *ignore,
@@ -804,13 +858,14 @@ static void blk_add_trace_getrq(void *ignore,
struct bio *bio, int rw)
{
if (bio)
- blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0,
+ blk_trace_bio_get_cgid(q, bio));
else {
struct blk_trace *bt = q->blk_trace;
if (bt)
__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_GETRQ, 0, 0,
- NULL);
+ NULL, NULL);
}
}
@@ -820,13 +875,14 @@ static void blk_add_trace_sleeprq(void *ignore,
struct bio *bio, int rw)
{
if (bio)
- blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0,
+ blk_trace_bio_get_cgid(q, bio));
else {
struct blk_trace *bt = q->blk_trace;
if (bt)
__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_SLEEPRQ,
- 0, 0, NULL);
+ 0, 0, NULL, NULL);
}
}
@@ -835,7 +891,7 @@ static void blk_add_trace_plug(void *ignore, struct request_queue *q)
struct blk_trace *bt = q->blk_trace;
if (bt)
- __blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL);
+ __blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, NULL);
}
static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
@@ -852,7 +908,7 @@ static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
else
what = BLK_TA_UNPLUG_TIMER;
- __blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+ __blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, NULL);
}
}
@@ -868,7 +924,7 @@ static void blk_add_trace_split(void *ignore,
__blk_add_trace(bt, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size, bio_op(bio), bio->bi_opf,
BLK_TA_SPLIT, bio->bi_error, sizeof(rpdu),
- &rpdu);
+ &rpdu, blk_trace_bio_get_cgid(q, bio));
}
}
@@ -901,7 +957,7 @@ static void blk_add_trace_bio_remap(void *ignore,
__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
bio_op(bio), bio->bi_opf, BLK_TA_REMAP, bio->bi_error,
- sizeof(r), &r);
+ sizeof(r), &r, blk_trace_bio_get_cgid(q, bio));
}
/**
@@ -934,7 +990,7 @@ static void blk_add_trace_rq_remap(void *ignore,
__blk_add_trace(bt, blk_rq_pos(rq), blk_rq_bytes(rq),
rq_data_dir(rq), 0, BLK_TA_REMAP, 0,
- sizeof(r), &r);
+ sizeof(r), &r, blk_trace_request_get_cgid(q, rq));
}
/**
@@ -958,7 +1014,8 @@ void blk_add_driver_data(struct request_queue *q,
return;
__blk_add_trace(bt, blk_rq_trace_sector(rq), blk_rq_bytes(rq), 0, 0,
- BLK_TA_DRV_DATA, 0, len, data);
+ BLK_TA_DRV_DATA, 0, len, data,
+ blk_trace_request_get_cgid(q, rq));
}
EXPORT_SYMBOL_GPL(blk_add_driver_data);
@@ -1031,7 +1088,7 @@ static void fill_rwbs(char *rwbs, const struct blk_io_trace *t)
int i = 0;
int tc = t->action >> BLK_TC_SHIFT;
- if (t->action == BLK_TN_MESSAGE) {
+ if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
rwbs[i++] = 'N';
goto out;
}
@@ -1066,9 +1123,21 @@ const struct blk_io_trace *te_blk_io_trace(const struct trace_entry *ent)
return (const struct blk_io_trace *)ent;
}
-static inline const void *pdu_start(const struct trace_entry *ent)
+static inline const void *pdu_start(const struct trace_entry *ent, bool has_cg)
{
- return te_blk_io_trace(ent) + 1;
+ return (void *)(te_blk_io_trace(ent) + 1) +
+ (has_cg ? sizeof(struct kernfs_node_id) : 0);
+}
+
+static inline const void *cgid_start(const struct trace_entry *ent)
+{
+ return (void *)(te_blk_io_trace(ent) + 1);
+}
+
+static inline int pdu_real_len(const struct trace_entry *ent, bool has_cg)
+{
+ return te_blk_io_trace(ent)->pdu_len -
+ (has_cg ? sizeof(struct kernfs_node_id) : 0);
}
static inline u32 t_action(const struct trace_entry *ent)
@@ -1096,16 +1165,16 @@ static inline __u16 t_error(const struct trace_entry *ent)
return te_blk_io_trace(ent)->error;
}
-static __u64 get_pdu_int(const struct trace_entry *ent)
+static __u64 get_pdu_int(const struct trace_entry *ent, bool has_cg)
{
- const __u64 *val = pdu_start(ent);
+ const __u64 *val = pdu_start(ent, has_cg);
return be64_to_cpu(*val);
}
static void get_pdu_remap(const struct trace_entry *ent,
- struct blk_io_trace_remap *r)
+ struct blk_io_trace_remap *r, bool has_cg)
{
- const struct blk_io_trace_remap *__r = pdu_start(ent);
+ const struct blk_io_trace_remap *__r = pdu_start(ent, has_cg);
__u64 sector_from = __r->sector_from;
r->device_from = be32_to_cpu(__r->device_from);
@@ -1113,9 +1182,11 @@ static void get_pdu_remap(const struct trace_entry *ent,
r->sector_from = be64_to_cpu(sector_from);
}
-typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act);
+typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act,
+ bool has_cg);
-static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
+static void blk_log_action_classic(struct trace_iterator *iter, const char *act,
+ bool has_cg)
{
char rwbs[RWBS_LEN];
unsigned long long ts = iter->ts;
@@ -1131,24 +1202,33 @@ static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
secs, nsec_rem, iter->ent->pid, act, rwbs);
}
-static void blk_log_action(struct trace_iterator *iter, const char *act)
+static void blk_log_action(struct trace_iterator *iter, const char *act,
+ bool has_cg)
{
char rwbs[RWBS_LEN];
const struct blk_io_trace *t = te_blk_io_trace(iter->ent);
fill_rwbs(rwbs, t);
- trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
- MAJOR(t->device), MINOR(t->device), act, rwbs);
+ if (has_cg) {
+ const struct kernfs_node_id *id = cgid_start(iter->ent);
+
+ trace_seq_printf(&iter->seq, "%3d,%-3d %8x,%-8x %2s %3s ",
+ MAJOR(t->device), MINOR(t->device),
+ id->ino, id->gen, act, rwbs);
+ } else
+ trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
+ MAJOR(t->device), MINOR(t->device), act, rwbs);
}
-static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_dump_pdu(struct trace_seq *s,
+ const struct trace_entry *ent, bool has_cg)
{
const unsigned char *pdu_buf;
int pdu_len;
int i, end;
- pdu_buf = pdu_start(ent);
- pdu_len = te_blk_io_trace(ent)->pdu_len;
+ pdu_buf = pdu_start(ent, has_cg);
+ pdu_len = pdu_real_len(ent, has_cg);
if (!pdu_len)
return;
@@ -1179,7 +1259,7 @@ static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
trace_seq_puts(s, ") ");
}
-static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];
@@ -1187,7 +1267,7 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
trace_seq_printf(s, "%u ", t_bytes(ent));
- blk_log_dump_pdu(s, ent);
+ blk_log_dump_pdu(s, ent, has_cg);
trace_seq_printf(s, "[%s]\n", cmd);
} else {
if (t_sec(ent))
@@ -1199,10 +1279,10 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
}
static void blk_log_with_error(struct trace_seq *s,
- const struct trace_entry *ent)
+ const struct trace_entry *ent, bool has_cg)
{
if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
- blk_log_dump_pdu(s, ent);
+ blk_log_dump_pdu(s, ent, has_cg);
trace_seq_printf(s, "[%d]\n", t_error(ent));
} else {
if (t_sec(ent))
@@ -1215,18 +1295,18 @@ static void blk_log_with_error(struct trace_seq *s,
}
}
-static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
struct blk_io_trace_remap r = { .device_from = 0, };
- get_pdu_remap(ent, &r);
+ get_pdu_remap(ent, &r, has_cg);
trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n",
t_sector(ent), t_sec(ent),
MAJOR(r.device_from), MINOR(r.device_from),
(unsigned long long)r.sector_from);
}
-static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];
@@ -1235,30 +1315,31 @@ static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
trace_seq_printf(s, "[%s]\n", cmd);
}
-static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];
trace_find_cmdline(ent->pid, cmd);
- trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent));
+ trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg));
}
-static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];
trace_find_cmdline(ent->pid, cmd);
trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
- get_pdu_int(ent), cmd);
+ get_pdu_int(ent, has_cg), cmd);
}
-static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent,
+ bool has_cg)
{
- const struct blk_io_trace *t = te_blk_io_trace(ent);
- trace_seq_putmem(s, t + 1, t->pdu_len);
+ trace_seq_putmem(s, pdu_start(ent, has_cg),
+ pdu_real_len(ent, has_cg));
trace_seq_putc(s, '\n');
}
@@ -1298,7 +1379,8 @@ static void blk_tracer_reset(struct trace_array *tr)
static const struct {
const char *act[2];
- void (*print)(struct trace_seq *s, const struct trace_entry *ent);
+ void (*print)(struct trace_seq *s, const struct trace_entry *ent,
+ bool has_cg);
} what2act[] = {
[__BLK_TA_QUEUE] = {{ "Q", "queue" }, blk_log_generic },
[__BLK_TA_BACKMERGE] = {{ "M", "backmerge" }, blk_log_generic },
@@ -1326,23 +1408,25 @@ static enum print_line_t print_one_line(struct trace_iterator *iter,
u16 what;
bool long_act;
blk_log_action_t *log_action;
+ bool has_cg;
t = te_blk_io_trace(iter->ent);
- what = t->action & ((1 << BLK_TC_SHIFT) - 1);
+ what = (t->action & ((1 << BLK_TC_SHIFT) - 1)) & ~__BLK_TA_CGROUP;
long_act = !!(tr->trace_flags & TRACE_ITER_VERBOSE);
log_action = classic ? &blk_log_action_classic : &blk_log_action;
+ has_cg = t->action & __BLK_TA_CGROUP;
- if (t->action == BLK_TN_MESSAGE) {
- log_action(iter, long_act ? "message" : "m");
- blk_log_msg(s, iter->ent);
+ if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
+ log_action(iter, long_act ? "message" : "m", has_cg);
+ blk_log_msg(s, iter->ent, has_cg);
return trace_handle_return(s);
}
if (unlikely(what == 0 || what >= ARRAY_SIZE(what2act)))
trace_seq_printf(s, "Unknown action %x\n", what);
else {
- log_action(iter, what2act[what].act[long_act]);
- what2act[what].print(s, iter->ent);
+ log_action(iter, what2act[what].act[long_act], has_cg);
+ what2act[what].print(s, iter->ent, has_cg);
}
return trace_handle_return(s);
--
2.9.3
From: Shaohua Li <[email protected]>
Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.
Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 2 +-
fs/kernfs/mount.c | 65 ++++++++++++++++++++++++++++++--
fs/sysfs/mount.c | 2 +-
include/linux/kernfs.h | 9 +++--
kernel/cgroup/cgroup.c | 3 +-
5 files changed, 71 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
index f5af0cc..fee2126 100644
--- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
+++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
@@ -854,7 +854,7 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type,
}
dentry = kernfs_mount(fs_type, flags, rdt_root,
- RDTGROUP_SUPER_MAGIC, NULL);
+ RDTGROUP_SUPER_MAGIC, NULL, false);
if (IS_ERR(dentry))
goto out_cdp;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 462a40c..11c5aba 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -16,6 +16,7 @@
#include <linux/pagemap.h>
#include <linux/namei.h>
#include <linux/seq_file.h>
+#include <linux/exportfs.h>
#include "kernfs-internal.h"
@@ -64,6 +65,59 @@ const struct super_operations kernfs_sops = {
.show_path = kernfs_sop_show_path,
};
+static struct inode *kernfs_fh_get_inode(struct super_block *sb,
+ u64 ino, u32 generation)
+{
+ struct kernfs_super_info *info = kernfs_info(sb);
+ struct inode *inode;
+ struct kernfs_node *kn;
+
+ if (ino == 0)
+ return ERR_PTR(-ESTALE);
+
+ kn = kernfs_get_node_by_ino(info->root, ino);
+ if (!kn)
+ return ERR_PTR(-ESTALE);
+ inode = kernfs_get_inode(sb, kn);
+ kernfs_put(kn);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+
+ if (generation && inode->i_generation != generation) {
+ /* we didn't find the right inode.. */
+ iput(inode);
+ return ERR_PTR(-ESTALE);
+ }
+ return inode;
+}
+
+static struct dentry *kernfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+ int fh_len, int fh_type)
+{
+ return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+ kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+ int fh_len, int fh_type)
+{
+ return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+ kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_get_parent_dentry(struct dentry *child)
+{
+ struct kernfs_node *kn = kernfs_dentry_node(child);
+
+ return d_obtain_alias(kernfs_get_inode(child->d_sb, kn->parent));
+}
+
+static const struct export_operations kernfs_export_ops = {
+ .fh_to_dentry = kernfs_fh_to_dentry,
+ .fh_to_parent = kernfs_fh_to_parent,
+ .get_parent = kernfs_get_parent_dentry,
+};
+
/**
* kernfs_root_from_sb - determine kernfs_root associated with a super_block
* @sb: the super_block in question
@@ -145,7 +199,8 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
} while (true);
}
-static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
+static int kernfs_fill_super(struct super_block *sb, unsigned long magic,
+ bool enable_expop)
{
struct kernfs_super_info *info = kernfs_info(sb);
struct inode *inode;
@@ -159,6 +214,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
sb->s_magic = magic;
sb->s_op = &kernfs_sops;
sb->s_xattr = kernfs_xattr_handlers;
+ if (enable_expop)
+ sb->s_export_op = &kernfs_export_ops;
sb->s_time_gran = 1;
/* get root inode, initialize and unlock it */
@@ -219,6 +276,7 @@ const void *kernfs_super_ns(struct super_block *sb)
* @magic: file system specific magic number
* @new_sb_created: tell the caller if we allocated a new superblock
* @ns: optional namespace tag of the mount
+ * @enable_expop: if adding fhandle support
*
* This is to be called from each kernfs user's file_system_type->mount()
* implementation, which should pass through the specified @fs_type and
@@ -229,7 +287,8 @@ const void *kernfs_super_ns(struct super_block *sb)
*/
struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
+ bool *new_sb_created, const void *ns,
+ bool enable_expop)
{
struct super_block *sb;
struct kernfs_super_info *info;
@@ -255,7 +314,7 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
if (!sb->s_root) {
struct kernfs_super_info *info = kernfs_info(sb);
- error = kernfs_fill_super(sb, magic);
+ error = kernfs_fill_super(sb, magic, enable_expop);
if (error) {
deactivate_locked_super(sb);
return ERR_PTR(error);
diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 20b8f82..d1a3336b 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -37,7 +37,7 @@ static struct dentry *sysfs_mount(struct file_system_type *fs_type,
ns = kobj_ns_grab_current(KOBJ_NS_TYPE_NET);
root = kernfs_mount_ns(fs_type, flags, sysfs_root,
- SYSFS_MAGIC, &new_sb, ns);
+ SYSFS_MAGIC, &new_sb, ns, false);
if (IS_ERR(root) || !new_sb)
kobj_ns_drop(KOBJ_NS_TYPE_NET, ns);
else if (new_sb)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 61668d1..15c805f 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -332,7 +332,8 @@ void kernfs_notify(struct kernfs_node *kn);
const void *kernfs_super_ns(struct super_block *sb);
struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns);
+ bool *new_sb_created, const void *ns,
+ bool enable_expop);
void kernfs_kill_sb(struct super_block *sb);
struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
@@ -435,7 +436,7 @@ static inline const void *kernfs_super_ns(struct super_block *sb)
static inline struct dentry *
kernfs_mount_ns(struct file_system_type *fs_type, int flags,
struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created, const void *ns)
+ bool *new_sb_created, const void *ns, bool enable_expop)
{ return ERR_PTR(-ENOSYS); }
static inline void kernfs_kill_sb(struct super_block *sb) { }
@@ -516,10 +517,10 @@ static inline int kernfs_rename(struct kernfs_node *kn,
static inline struct dentry *
kernfs_mount(struct file_system_type *fs_type, int flags,
struct kernfs_root *root, unsigned long magic,
- bool *new_sb_created)
+ bool *new_sb_created, bool enable_expop)
{
return kernfs_mount_ns(fs_type, flags, root,
- magic, new_sb_created, NULL);
+ magic, new_sb_created, NULL, enable_expop);
}
#endif /* __LINUX_KERNFS_H */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c3c9a0e..206d8df 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1749,7 +1749,8 @@ struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
struct dentry *dentry;
bool new_sb;
- dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb);
+ dentry = kernfs_mount(fs_type, flags, root->kf_root, magic, &new_sb,
+ true);
/*
* In non-init cgroup namespace, instead of root cgroup's dentry,
--
2.9.3
From: Shaohua Li <[email protected]>
bio_free isn't a good place to free cgroup/integrity info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
__bio_free call in bio_free is kept, in case the bio never gets called
bio endio.
Signed-off-by: Shaohua Li <[email protected]>
---
block/bio-integrity.c | 1 +
block/bio.c | 1 +
2 files changed, 2 insertions(+)
diff --git a/block/bio-integrity.c b/block/bio-integrity.c
index 5384713..c6c5aed 100644
--- a/block/bio-integrity.c
+++ b/block/bio-integrity.c
@@ -120,6 +120,7 @@ void bio_integrity_free(struct bio *bio)
}
bio->bi_integrity = NULL;
+ bio->bi_opf &= ~REQ_INTEGRITY;
}
EXPORT_SYMBOL(bio_integrity_free);
diff --git a/block/bio.c b/block/bio.c
index 888e780..02556b3 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1803,6 +1803,7 @@ void bio_endio(struct bio *bio)
if (!bio_remaining_done(bio))
return;
+ __bio_free(bio);
/*
* Need to have a real endio function for chained bios, otherwise
* various corner cases will break (like stacking block devices that
--
2.9.3
From: Shaohua Li <[email protected]>
blkcg_bio_issue_check() already gets blkcg for a BIO.
bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
operation. There is no point we don't attach the cgroup info into bio at
blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
info.
Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 7 +------
include/linux/blk-cgroup.h | 2 ++
2 files changed, 3 insertions(+), 6 deletions(-)
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index b78db2e..53d3e3d 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2033,14 +2033,9 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td)
static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
{
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
- int ret;
-
- ret = bio_associate_current(bio);
- if (ret == 0 || ret == -EBUSY)
+ if (bio->bi_css)
bio->bi_cg_private = tg;
blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
-#else
- bio_associate_current(bio);
#endif
}
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 01b62e7..d176247 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -691,6 +691,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
rcu_read_lock();
blkcg = bio_blkcg(bio);
+ bio_associate_blkcg(bio, &blkcg->css);
+
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
--
2.9.3
From: Shaohua Li <[email protected]>
Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.
To make the API lock free, kernfs node is freed in RCU context. And we
depend on kernfs_node count/ino number to filter stale kernfs nodes.
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 35 +++++++++++++++++++++++++++++++++++
fs/kernfs/kernfs-internal.h | 2 ++
fs/kernfs/mount.c | 4 +++-
3 files changed, 40 insertions(+), 1 deletion(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8e8545a..4c86e4c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -643,6 +643,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
kn->ino = ret;
kn->generation = atomic_inc_return(&root->next_generation);
+ /* set ino first. Above atomic_inc_return has a barrier */
atomic_set(&kn->count, 1);
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
RB_CLEAR_NODE(&kn->rb);
@@ -674,6 +675,40 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
return kn;
}
+/*
+ * kernfs_get_node_by_ino - get kernfs_node from inode number
+ * @root: the kernfs root
+ * @ino: inode number
+ *
+ * RETURNS:
+ * NULL on failure. Return a kernfs node with reference counter incremented
+ */
+struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
+ unsigned int ino)
+{
+ struct kernfs_node *kn;
+
+ rcu_read_lock();
+ kn = idr_find(&root->ino_idr, ino);
+ if (!kn)
+ goto out;
+ /* kernfs_put removes the ino after count is 0 */
+ if (!atomic_inc_not_zero(&kn->count)) {
+ kn = NULL;
+ goto out;
+ }
+ /* If this node is reused, __kernfs_new_node sets ino before count */
+ if (kn->ino != ino)
+ goto out;
+ rcu_read_unlock();
+
+ return kn;
+out:
+ rcu_read_unlock();
+ kernfs_put(kn);
+ return NULL;
+}
+
/**
* kernfs_add_one - add kernfs_node to parent without warning
* @kn: kernfs_node to be added
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 2d5144a..3534cfe 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -98,6 +98,8 @@ int kernfs_add_one(struct kernfs_node *kn);
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
unsigned flags);
+struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
+ unsigned int ino);
/*
* file.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d5b149a..343dfeb 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -332,5 +332,7 @@ void __init kernfs_init(void)
{
kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
sizeof(struct kernfs_node),
- 0, SLAB_PANIC, NULL);
+ 0,
+ SLAB_PANIC | SLAB_TYPESAFE_BY_RCU,
+ NULL);
}
--
2.9.3
From: Shaohua Li <[email protected]>
Set i_generation for kernfs inode. This is required to implement exportfs
operations.
Note, the generation is 32-bit, so it's possible the generation wraps up
and we find stale files. The possiblity is low, since fhandle matches
both inode number and generation. In most fs, the generation is 32-bit.
fhandle only export 32-bit generation for most fs. So unless we have
solid reason, we'd live with the possible conflict.
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 2 ++
fs/kernfs/inode.c | 1 +
include/linux/kernfs.h | 2 ++
3 files changed, 5 insertions(+)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index db5900aaa..09d093e 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -634,6 +634,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
if (ret < 0)
goto err_out2;
kn->ino = ret;
+ kn->generation = atomic_inc_return(&root->next_generation);
atomic_set(&kn->count, 1);
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
@@ -877,6 +878,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
ida_init(&root->ino_ida);
INIT_LIST_HEAD(&root->supers);
+ atomic_set(&root->next_generation, 0);
kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
KERNFS_DIR);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fb4b4a7..79cdae4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,6 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
inode->i_private = kn;
inode->i_mapping->a_ops = &kernfs_aops;
inode->i_op = &kernfs_iops;
+ inode->i_generation = kn->generation;
set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index a9b11b8..c5f0fa7 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -135,6 +135,7 @@ struct kernfs_node {
umode_t mode;
unsigned int ino;
struct kernfs_iattrs *iattr;
+ u32 generation;
};
/*
@@ -170,6 +171,7 @@ struct kernfs_root {
struct list_head supers;
wait_queue_head_t deactivate_waitq;
+ atomic_t next_generation;
};
struct kernfs_open_file {
--
2.9.3
On Fri, Jun 02, 2017 at 02:53:56PM -0700, Shaohua Li wrote:
> From: Shaohua Li <[email protected]>
>
> Add an API to get kernfs node from inode number. We will need this to
> implement exportfs operations.
>
> To make the API lock free, kernfs node is freed in RCU context. And we
> depend on kernfs_node count/ino number to filter stale kernfs nodes.
>
> Signed-off-by: Shaohua Li <[email protected]>
> ---
> fs/kernfs/dir.c | 35 +++++++++++++++++++++++++++++++++++
> fs/kernfs/kernfs-internal.h | 2 ++
> fs/kernfs/mount.c | 4 +++-
> 3 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 8e8545a..4c86e4c 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -643,6 +643,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> kn->ino = ret;
> kn->generation = atomic_inc_return(&root->next_generation);
>
> + /* set ino first. Above atomic_inc_return has a barrier */
> atomic_set(&kn->count, 1);
> atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
> RB_CLEAR_NODE(&kn->rb);
> @@ -674,6 +675,40 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
> return kn;
> }
>
> +/*
> + * kernfs_get_node_by_ino - get kernfs_node from inode number
> + * @root: the kernfs root
> + * @ino: inode number
> + *
> + * RETURNS:
> + * NULL on failure. Return a kernfs node with reference counter incremented
> + */
Is the above supposed to be a valid kernel doc entry?
> +struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
> + unsigned int ino)
> +{
> + struct kernfs_node *kn;
> +
> + rcu_read_lock();
> + kn = idr_find(&root->ino_idr, ino);
> + if (!kn)
> + goto out;
> + /* kernfs_put removes the ino after count is 0 */
> + if (!atomic_inc_not_zero(&kn->count)) {
> + kn = NULL;
Why do yo need to set kn to NULL?
> + goto out;
> + }
> + /* If this node is reused, __kernfs_new_node sets ino before count */
> + if (kn->ino != ino)
> + goto out;
> + rcu_read_unlock();
> +
> + return kn;
> +out:
> + rcu_read_unlock();
> + kernfs_put(kn);
> + return NULL;
> +}
> +
> /**
> * kernfs_add_one - add kernfs_node to parent without warning
> * @kn: kernfs_node to be added
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index 2d5144a..3534cfe 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -98,6 +98,8 @@ int kernfs_add_one(struct kernfs_node *kn);
> struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
> const char *name, umode_t mode,
> unsigned flags);
> +struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
> + unsigned int ino);
>
> /*
> * file.c
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index d5b149a..343dfeb 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -332,5 +332,7 @@ void __init kernfs_init(void)
> {
> kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
> sizeof(struct kernfs_node),
> - 0, SLAB_PANIC, NULL);
> + 0,
> + SLAB_PANIC | SLAB_TYPESAFE_BY_RCU,
> + NULL);
> }
> --
> 2.9.3
>
>
--
All the best,
Eduardo Valentin
On Fri, Jun 02, 2017 at 03:03:45PM -0700, Eduardo Valentin wrote:
> On Fri, Jun 02, 2017 at 02:53:56PM -0700, Shaohua Li wrote:
> > From: Shaohua Li <[email protected]>
> >
> > Add an API to get kernfs node from inode number. We will need this to
> > implement exportfs operations.
> >
> > To make the API lock free, kernfs node is freed in RCU context. And we
> > depend on kernfs_node count/ino number to filter stale kernfs nodes.
> >
> > Signed-off-by: Shaohua Li <[email protected]>
> > ---
> > fs/kernfs/dir.c | 35 +++++++++++++++++++++++++++++++++++
> > fs/kernfs/kernfs-internal.h | 2 ++
> > fs/kernfs/mount.c | 4 +++-
> > 3 files changed, 40 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 8e8545a..4c86e4c 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -643,6 +643,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> > kn->ino = ret;
> > kn->generation = atomic_inc_return(&root->next_generation);
> >
> > + /* set ino first. Above atomic_inc_return has a barrier */
> > atomic_set(&kn->count, 1);
> > atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
> > RB_CLEAR_NODE(&kn->rb);
> > @@ -674,6 +675,40 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
> > return kn;
> > }
> >
> > +/*
> > + * kernfs_get_node_by_ino - get kernfs_node from inode number
> > + * @root: the kernfs root
> > + * @ino: inode number
> > + *
> > + * RETURNS:
> > + * NULL on failure. Return a kernfs node with reference counter incremented
> > + */
>
> Is the above supposed to be a valid kernel doc entry?
what do you expect? The function name explains it very well actually.
> > +struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
> > + unsigned int ino)
> > +{
> > + struct kernfs_node *kn;
> > +
> > + rcu_read_lock();
> > + kn = idr_find(&root->ino_idr, ino);
> > + if (!kn)
> > + goto out;
> > + /* kernfs_put removes the ino after count is 0 */
> > + if (!atomic_inc_not_zero(&kn->count)) {
> > + kn = NULL;
>
> Why do yo need to set kn to NULL?
I don't know what kind of explanation you expect. This is quite obvious
actually. If the count == 0, we don't increase the ref count, so we don't
decrease the ref count later (in kernfs_put).
> > + goto out;
> > + }
> > + /* If this node is reused, __kernfs_new_node sets ino before count */
> > + if (kn->ino != ino)
> > + goto out;
> > + rcu_read_unlock();
> > +
> > + return kn;
> > +out:
> > + rcu_read_unlock();
> > + kernfs_put(kn);
> > + return NULL;
> > +}
> > +
> > /**
> > * kernfs_add_one - add kernfs_node to parent without warning
> > * @kn: kernfs_node to be added
> > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> > index 2d5144a..3534cfe 100644
> > --- a/fs/kernfs/kernfs-internal.h
> > +++ b/fs/kernfs/kernfs-internal.h
> > @@ -98,6 +98,8 @@ int kernfs_add_one(struct kernfs_node *kn);
> > struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
> > const char *name, umode_t mode,
> > unsigned flags);
> > +struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
> > + unsigned int ino);
> >
> > /*
> > * file.c
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index d5b149a..343dfeb 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -332,5 +332,7 @@ void __init kernfs_init(void)
> > {
> > kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
> > sizeof(struct kernfs_node),
> > - 0, SLAB_PANIC, NULL);
> > + 0,
> > + SLAB_PANIC | SLAB_TYPESAFE_BY_RCU,
> > + NULL);
> > }
> > --
> > 2.9.3
> >
> >
>
> --
> All the best,
> Eduardo Valentin
Hi Shaohua,
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Shaohua-Li/kernfs-implement-i_generation/20170603-083132
config: x86_64-randconfig-x010-201722 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All errors (new ones prefixed by >>):
kernel/trace/blktrace.c: In function '__trace_note_message':
>> kernel/trace/blktrace.c:185:35: error: 'struct blkcg' has no member named 'css'
blkcg ? cgroup_get_node_id(blkcg->css.cgroup) : NULL);
^~
vim +185 kernel/trace/blktrace.c
179 n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
180 va_end(args);
181
182 if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
183 blkcg = NULL;
184 trace_note(bt, 0, BLK_TN_MESSAGE, buf, n,
> 185 blkcg ? cgroup_get_node_id(blkcg->css.cgroup) : NULL);
186 local_irq_restore(flags);
187 }
188 EXPORT_SYMBOL_GPL(__trace_note_message);
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Shaohua,
[auto build test ERROR on driver-core/driver-core-testing]
[also build test ERROR on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Shaohua-Li/kernfs-implement-i_generation/20170603-083132
config: x86_64-randconfig-x015-201722 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64
All error/warnings (new ones prefixed by >>):
block/blk-throttle.c: In function 'throtl_schedule_pending_timer':
>> block/blk-throttle.c:355:3: error: implicit declaration of function 'blk_add_cgroup_trace_msg' [-Werror=implicit-function-declaration]
blk_add_cgroup_trace_msg(__td->queue, \
^
>> block/blk-throttle.c:697:2: note: in expansion of macro 'throtl_log'
throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu",
^~~~~~~~~~
cc1: some warnings being treated as errors
vim +/blk_add_cgroup_trace_msg +355 block/blk-throttle.c
349 struct throtl_data *__td = sq_to_td((sq)); \
350 \
351 (void)__td; \
352 if (likely(!blk_trace_note_message_enabled(__td->queue))) \
353 break; \
354 if ((__tg)) { \
> 355 blk_add_cgroup_trace_msg(__td->queue, \
356 tg_to_blkg(__tg)->blkcg, "throtl " fmt, ##args);\
357 } else { \
358 blk_add_trace_msg(__td->queue, "throtl " fmt, ##args); \
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Shaohua,
[auto build test WARNING on driver-core/driver-core-testing]
[also build test WARNING on v4.12-rc3 next-20170602]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/Shaohua-Li/kernfs-implement-i_generation/20170603-083132
reproduce: make htmldocs
All warnings (new ones prefixed by >>):
WARNING: convert(1) not found, for SVG to PDF conversion install ImageMagick (https://www.imagemagick.org)
arch/x86/include/asm/uaccess_32.h:1: warning: no structured comments found
>> kernel/trace/blktrace.c:750: warning: No description found for parameter 'cgid'
kernel/trace/blktrace.c:807: warning: No description found for parameter 'cgid'
include/linux/init.h:1: warning: no structured comments found
include/linux/mod_devicetable.h:686: warning: Excess struct/union/enum/typedef member 'ver_major' description in 'fsl_mc_device_id'
include/linux/mod_devicetable.h:686: warning: Excess struct/union/enum/typedef member 'ver_minor' description in 'fsl_mc_device_id'
kernel/sched/core.c:2088: warning: No description found for parameter 'rf'
kernel/sched/core.c:2088: warning: Excess function parameter 'cookie' description in 'try_to_wake_up_local'
include/linux/kthread.h:26: warning: Excess function parameter '...' description in 'kthread_create'
kernel/sys.c:1: warning: no structured comments found
include/linux/device.h:969: warning: No description found for parameter 'dma_ops'
drivers/dma-buf/seqno-fence.c:1: warning: no structured comments found
include/linux/iio/iio.h:597: warning: No description found for parameter 'trig_readonly'
include/linux/iio/trigger.h:151: warning: No description found for parameter 'indio_dev'
include/linux/iio/trigger.h:151: warning: No description found for parameter 'trig'
include/linux/device.h:970: warning: No description found for parameter 'dma_ops'
include/linux/usb/gadget.h:230: warning: No description found for parameter 'claimed'
include/linux/usb/gadget.h:230: warning: No description found for parameter 'enabled'
include/linux/usb/gadget.h:408: warning: No description found for parameter 'quirk_altset_not_supp'
include/linux/usb/gadget.h:408: warning: No description found for parameter 'quirk_stall_not_supp'
include/linux/usb/gadget.h:408: warning: No description found for parameter 'quirk_zlp_not_supp'
include/drm/drm_drv.h:524: warning: No description found for parameter 'set_busid'
include/drm/drm_drv.h:524: warning: No description found for parameter 'irq_handler'
include/drm/drm_drv.h:524: warning: No description found for parameter 'irq_preinstall'
include/drm/drm_drv.h:524: warning: No description found for parameter 'irq_postinstall'
include/drm/drm_drv.h:524: warning: No description found for parameter 'irq_uninstall'
include/drm/drm_drv.h:524: warning: No description found for parameter 'debugfs_init'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_open_object'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_close_object'
include/drm/drm_drv.h:524: warning: No description found for parameter 'prime_handle_to_fd'
include/drm/drm_drv.h:524: warning: No description found for parameter 'prime_fd_to_handle'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_export'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_import'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_pin'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_unpin'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_res_obj'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_get_sg_table'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_import_sg_table'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_vmap'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_vunmap'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_prime_mmap'
include/drm/drm_drv.h:524: warning: No description found for parameter 'gem_vm_ops'
include/drm/drm_drv.h:524: warning: No description found for parameter 'major'
include/drm/drm_drv.h:524: warning: No description found for parameter 'minor'
include/drm/drm_drv.h:524: warning: No description found for parameter 'patchlevel'
include/drm/drm_drv.h:524: warning: No description found for parameter 'name'
include/drm/drm_drv.h:524: warning: No description found for parameter 'desc'
include/drm/drm_drv.h:524: warning: No description found for parameter 'date'
include/drm/drm_drv.h:524: warning: No description found for parameter 'driver_features'
include/drm/drm_drv.h:524: warning: No description found for parameter 'ioctls'
include/drm/drm_drv.h:524: warning: No description found for parameter 'num_ioctls'
include/drm/drm_drv.h:524: warning: No description found for parameter 'fops'
include/drm/drm_color_mgmt.h:1: warning: no structured comments found
drivers/gpu/drm/drm_plane_helper.c:403: warning: No description found for parameter 'ctx'
drivers/gpu/drm/drm_plane_helper.c:404: warning: No description found for parameter 'ctx'
drivers/gpu/drm/i915/intel_lpe_audio.c:355: warning: No description found for parameter 'dp_output'
drivers/gpu/drm/i915/intel_lpe_audio.c:355: warning: No description found for parameter 'link_rate'
drivers/gpu/drm/i915/intel_lpe_audio.c:356: warning: No description found for parameter 'dp_output'
drivers/gpu/drm/i915/intel_lpe_audio.c:356: warning: No description found for parameter 'link_rate'
Documentation/core-api/assoc_array.rst:13: WARNING: Enumerated list ends without a blank line; unexpected unindent.
Documentation/doc-guide/sphinx.rst:126: ERROR: Unknown target name: "sphinx c domain".
kernel/sched/fair.c:7650: WARNING: Inline emphasis start-string without end-string.
kernel/time/timer.c:1200: ERROR: Unexpected indentation.
kernel/time/timer.c:1202: ERROR: Unexpected indentation.
kernel/time/timer.c:1203: WARNING: Block quote ends without a blank line; unexpected unindent.
include/linux/wait.h:122: WARNING: Block quote ends without a blank line; unexpected unindent.
include/linux/wait.h:125: ERROR: Unexpected indentation.
include/linux/wait.h:127: WARNING: Block quote ends without a blank line; unexpected unindent.
kernel/time/hrtimer.c:990: WARNING: Block quote ends without a blank line; unexpected unindent.
kernel/signal.c:322: WARNING: Inline literal start-string without end-string.
include/linux/iio/iio.h:219: ERROR: Unexpected indentation.
include/linux/iio/iio.h:220: WARNING: Block quote ends without a blank line; unexpected unindent.
include/linux/iio/iio.h:226: WARNING: Definition list ends without a blank line; unexpected unindent.
drivers/iio/industrialio-core.c:638: ERROR: Unknown target name: "iio_val".
drivers/iio/industrialio-core.c:645: ERROR: Unknown target name: "iio_val".
drivers/message/fusion/mptbase.c:5051: WARNING: Definition list ends without a blank line; unexpected unindent.
drivers/tty/serial/serial_core.c:1898: WARNING: Definition list ends without a blank line; unexpected unindent.
drivers/pci/pci.c:3456: ERROR: Unexpected indentation.
include/linux/regulator/driver.h:271: ERROR: Unknown target name: "regulator_regmap_x_voltage".
include/linux/spi/spi.h:370: ERROR: Unexpected indentation.
drivers/gpu/drm/drm_scdc_helper.c:203: ERROR: Unexpected indentation.
drivers/gpu/drm/drm_scdc_helper.c:204: WARNING: Block quote ends without a blank line; unexpected unindent.
drivers/gpu/drm/drm_ioctl.c:690: WARNING: Definition list ends without a blank line; unexpected unindent.
Documentation/gpu/todo.rst:111: ERROR: Unknown target name: "drm_fb".
sound/soc/soc-core.c:2670: ERROR: Unknown target name: "snd_soc_daifmt".
sound/core/jack.c:312: ERROR: Unknown target name: "snd_jack_btn".
Documentation/userspace-api/unshare.rst:108: WARNING: Inline emphasis start-string without end-string.
Documentation/usb/typec.rst:: WARNING: document isn't included in any toctree
Documentation/usb/usb3-debug-port.rst:: WARNING: document isn't included in any toctree
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 43: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 56: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 69: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 82: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 96: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 109: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 122: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 133: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 164: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "/home/kbuild/.config/fontconfig/fonts.conf", line 193: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "~/.fonts.conf", line 43: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "~/.fonts.conf", line 56: Having multiple values in <test> isn't supported and may not work as expected
Fontconfig warning: "~/.fonts.conf", line 69: Having multiple values in <test> isn't supported and may not work as expected
vim +/cgid +750 kernel/trace/blktrace.c
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 734
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 735 /**
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 736 * blk_add_trace_rq - Add a trace for a request oriented action
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 737 * @rq: the source request
caf7df12 kernel/trace/blktrace.c Christoph Hellwig 2017-04-20 738 * @error: return status to log
af5040da kernel/trace/blktrace.c Roman Pen 2014-03-04 739 * @nr_bytes: number of completed bytes
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 740 * @what: the action
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 741 *
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 742 * Description:
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 743 * Records an action against a request. Will log the bio offset + size.
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 744 *
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 745 **/
caf7df12 kernel/trace/blktrace.c Christoph Hellwig 2017-04-20 746 static void blk_add_trace_rq(struct request *rq, int error,
f46b1436 kernel/trace/blktrace.c Shaohua Li 2017-06-02 747 unsigned int nr_bytes, u32 what,
f46b1436 kernel/trace/blktrace.c Shaohua Li 2017-06-02 748 struct kernfs_node_id *cgid)
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 749 {
caf7df12 kernel/trace/blktrace.c Christoph Hellwig 2017-04-20 @750 struct blk_trace *bt = rq->q->blk_trace;
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 751
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 752 if (likely(!bt))
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 753 return;
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 754
57292b58 kernel/trace/blktrace.c Christoph Hellwig 2017-01-31 755 if (blk_rq_is_passthrough(rq))
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 756 what |= BLK_TC_ACT(BLK_TC_PC);
48b77ad6 kernel/trace/blktrace.c Christoph Hellwig 2017-01-27 757 else
5f3ea37c block/blktrace.c Arnaldo Carvalho de Melo 2008-10-30 758 what |= BLK_TC_ACT(BLK_TC_FS);
:::::: The code at line 750 was first introduced by commit
:::::: caf7df12272118e0274c8353bcfeaf60c7743a47 block: remove the errors field from struct request
:::::: TO: Christoph Hellwig <[email protected]>
:::::: CC: Jens Axboe <[email protected]>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hello,
On Fri, Jun 02, 2017 at 02:53:55PM -0700, Shaohua Li wrote:
> @@ -630,7 +633,11 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> if (!kn)
> goto err_out1;
>
> - ret = ida_simple_get(&root->ino_ida, 1, 0, GFP_KERNEL);
> + idr_preload(GFP_KERNEL);
> + spin_lock(&kernfs_idr_lock);
> + ret = idr_alloc(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
> + spin_unlock(&kernfs_idr_lock);
> + idr_preload_end();
So, this ends up populating the idr w/ a kn which isn't fully
initialized yet, which can lead to bugs which are difficult to hunt
down as we're gonna allow kn's to be looked up through ino / gen. We
probably should allocate with NULL here and populate the actual
pointer with idr_replace() after the kn is fully initialized / online.
Thanks.
--
tejun
Hello,
On Fri, Jun 02, 2017 at 02:53:56PM -0700, Shaohua Li wrote:
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -643,6 +643,7 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
> kn->ino = ret;
> kn->generation = atomic_inc_return(&root->next_generation);
>
> + /* set ino first. Above atomic_inc_return has a barrier */
> atomic_set(&kn->count, 1);
> atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
> RB_CLEAR_NODE(&kn->rb);
Ah, you filter not-fully-alive ones here w/ kn->count. Hmm... this
definitely can use more documentation including what this is paired
with (the inc_not_zero in kernfs_get_node_by_ino()) and why we need
this.
> +/*
> + * kernfs_get_node_by_ino - get kernfs_node from inode number
> + * @root: the kernfs root
> + * @ino: inode number
> + *
> + * RETURNS:
> + * NULL on failure. Return a kernfs node with reference counter incremented
> + */
> +struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
> + unsigned int ino)
> +{
> + struct kernfs_node *kn;
> +
> + rcu_read_lock();
> + kn = idr_find(&root->ino_idr, ino);
> + if (!kn)
> + goto out;
> + /* kernfs_put removes the ino after count is 0 */
> + if (!atomic_inc_not_zero(&kn->count)) {
> + kn = NULL;
> + goto out;
> + }
> + /* If this node is reused, __kernfs_new_node sets ino before count */
> + if (kn->ino != ino)
> + goto out;
> + rcu_read_unlock();
> +
> + return kn;
> +out:
> + rcu_read_unlock();
> + kernfs_put(kn);
> + return NULL;
> +}
Yeah, I think this should work. I think we could have gone with
dumber "use the same lock for lookup" but this isn't too complicated
either and has obvious scalability benefits. That said, let's please
be more verbose on how the two paths interlock with each other.
Thanks.
--
tejun
Hello,
On Fri, Jun 02, 2017 at 02:53:57PM -0700, Shaohua Li wrote:
> @@ -570,7 +570,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
> goto out_bad;
>
> /* The kernfs node has been moved? */
> - if (dentry->d_parent->d_fsdata != kn->parent)
> + if (d_really_is_negative(dentry->d_parent) ||
> + kernfs_dentry_node(dentry->d_parent) != kn->parent)
> goto out_bad;
Can we move d_really_is_negative() into kernfs_dentry_node()? That
might add an additional NULL test to some paths but I don't think that
will ever show up anywhere.
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index 3534cfe..82e11fa 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -70,6 +70,8 @@ struct kernfs_super_info {
> };
> #define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
>
> +#define kernfs_dentry_node(d) ((d_inode(d))->i_private)
I know the prev one is a macro but let's make the new one an inline
function.
Thnaks.
--
tejun
Ooh, one more thing.
On Mon, Jun 12, 2017 at 02:20:28PM -0400, Tejun Heo wrote:
> > +struct kernfs_node *kernfs_get_node_by_ino(struct kernfs_root *root,
> > + unsigned int ino)
Can we name this kernfs_find_and_get_by_ino() for consistency? And
the RCU optimization does seem prominent compared to other find/get
functions which all just use kernfs_mutex (still not objecting).
Thanks.
--
tejun
On Fri, Jun 02, 2017 at 02:53:59PM -0700, Shaohua Li wrote:
> From: Shaohua Li <[email protected]>
>
> Add an API to export cgroup fhandle info. We don't export a full 'struct
> file_handle', there are unrequired info. Sepcifically, cgroup is always
> a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
> we only need export the inode number and generation number just like
> what generic_fh_to_parent does. And we can avoid the overhead of getting
> an inode too, since kernfs_node has all the info required.
Can't we just make it an integral (optional) part of kernfs? So that
cgroup just needs to indicate that it wants to expose fhandles when
creating its kernfs instance?
Thanks.
--
tejun
On Fri, Jun 02, 2017 at 02:54:01PM -0700, Shaohua Li wrote:
> @@ -691,6 +691,8 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
> rcu_read_lock();
> blkcg = bio_blkcg(bio);
>
> + bio_associate_blkcg(bio, &blkcg->css);
> +
Let's please note that this only established the fallback mapping when
the bio hasn't been associated yet and doesn't override the existing
association.
Thanks.
--
tejun