2017-07-12 18:50:32

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 00/11] blktrace: output cgroup info

From: Shaohua Li <[email protected]>

Hi,

Currently blktrace isn't cgroup aware. blktrace prints out task name of current
context, but the task of current context isn't always in the cgroup where the
BIO comes from. We can't use task name to find out IO cgroup. For example,
Writeback BIOs always comes from flusher thread but the BIOs are for different
blk cgroups. Request could be requeued and dispatched from completely different
tasks. MD/DM are another examples. This brings challenges if we want to use
blktrace for performance tunning with cgroup enabled.

This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
use a BPF program to filter out blktrace for a specific cgroup.

The first 6 patches adds export operation handlers for kernfs, so userspace can
use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
blktrace output cgroup info.

Thanks,
Shaohua

V4 -> V5:
- Add ack and refresh to latest -next

V3 -> V4:
- Address some issues pointed out by Tejun
- Fix build errors with latest -next tree

V2 -> V3:
- Uses 32 bits inode number
- Refresh to latest -next tree

V1 -> V2:
- Fix a bug in cgroup association
- Fix build errors reported by 0day
- Address some issues pointed out by Tejun

Shaohua Li (11):
kernfs: use idr instead of ida to manage inode number
kernfs: implement i_generation
kernfs: add an API to get kernfs node from inode number
kernfs: don't set dentry->d_fsdata
kernfs: introduce kernfs_node_id
kernfs: add exportfs operations
cgroup: export fhandle info for a cgroup
blktrace: export cgroup info in trace
block: always attach cgroup info into bio
blktrace: add an option to allow displaying cgroup path
block: use standard blktrace API to output cgroup info for debug notes

block/bfq-iosched.h | 13 +-
block/blk-throttle.c | 13 +-
block/cfq-iosched.c | 15 +--
fs/kernfs/dir.c | 111 ++++++++++++----
fs/kernfs/file.c | 10 +-
fs/kernfs/inode.c | 9 +-
fs/kernfs/kernfs-internal.h | 9 ++
fs/kernfs/mount.c | 94 ++++++++++++--
fs/kernfs/symlink.c | 6 +-
include/linux/blk-cgroup.h | 3 +
include/linux/blktrace_api.h | 13 +-
include/linux/cgroup.h | 16 ++-
include/linux/kernfs.h | 28 ++++-
include/trace/events/writeback.h | 2 +-
include/uapi/linux/blktrace_api.h | 3 +
kernel/cgroup/cgroup.c | 15 ++-
kernel/trace/blktrace.c | 259 ++++++++++++++++++++++++++------------
17 files changed, 467 insertions(+), 152 deletions(-)

--
2.9.3


2017-07-12 18:50:37

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 06/11] kernfs: add exportfs operations

From: Shaohua Li <[email protected]>

Now we have the facilities to implement exportfs operations. The idea is
cgroup can export the fhandle info to userspace, then userspace uses
fhandle to find the cgroup name. Another example is userspace can get
fhandle for a cgroup and BPF uses the fhandle to filter info for the
cgroup.

Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/mount.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 12 +++++++++++
kernel/cgroup/cgroup.c | 3 ++-
3 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index acd5426..fa32358 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -16,6 +16,7 @@
#include <linux/pagemap.h>
#include <linux/namei.h>
#include <linux/seq_file.h>
+#include <linux/exportfs.h>

#include "kernfs-internal.h"

@@ -64,6 +65,59 @@ const struct super_operations kernfs_sops = {
.show_path = kernfs_sop_show_path,
};

+static struct inode *kernfs_fh_get_inode(struct super_block *sb,
+ u64 ino, u32 generation)
+{
+ struct kernfs_super_info *info = kernfs_info(sb);
+ struct inode *inode;
+ struct kernfs_node *kn;
+
+ if (ino == 0)
+ return ERR_PTR(-ESTALE);
+
+ kn = kernfs_find_and_get_node_by_ino(info->root, ino);
+ if (!kn)
+ return ERR_PTR(-ESTALE);
+ inode = kernfs_get_inode(sb, kn);
+ kernfs_put(kn);
+ if (IS_ERR(inode))
+ return ERR_CAST(inode);
+
+ if (generation && inode->i_generation != generation) {
+ /* we didn't find the right inode.. */
+ iput(inode);
+ return ERR_PTR(-ESTALE);
+ }
+ return inode;
+}
+
+static struct dentry *kernfs_fh_to_dentry(struct super_block *sb, struct fid *fid,
+ int fh_len, int fh_type)
+{
+ return generic_fh_to_dentry(sb, fid, fh_len, fh_type,
+ kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_fh_to_parent(struct super_block *sb, struct fid *fid,
+ int fh_len, int fh_type)
+{
+ return generic_fh_to_parent(sb, fid, fh_len, fh_type,
+ kernfs_fh_get_inode);
+}
+
+static struct dentry *kernfs_get_parent_dentry(struct dentry *child)
+{
+ struct kernfs_node *kn = kernfs_dentry_node(child);
+
+ return d_obtain_alias(kernfs_get_inode(child->d_sb, kn->parent));
+}
+
+static const struct export_operations kernfs_export_ops = {
+ .fh_to_dentry = kernfs_fh_to_dentry,
+ .fh_to_parent = kernfs_fh_to_parent,
+ .get_parent = kernfs_get_parent_dentry,
+};
+
/**
* kernfs_root_from_sb - determine kernfs_root associated with a super_block
* @sb: the super_block in question
@@ -159,6 +213,8 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
sb->s_magic = magic;
sb->s_op = &kernfs_sops;
sb->s_xattr = kernfs_xattr_handlers;
+ if (info->root->flags & KERNFS_ROOT_SUPPORT_EXPORTOP)
+ sb->s_export_op = &kernfs_export_ops;
sb->s_time_gran = 1;

/* get root inode, initialize and unlock it */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 06a0c59..d149361 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -69,6 +69,12 @@ enum kernfs_root_flag {
* following flag enables that behavior.
*/
KERNFS_ROOT_EXTRA_OPEN_PERM_CHECK = 0x0002,
+
+ /*
+ * The filesystem supports exportfs operation, so userspace can use
+ * fhandle to access nodes of the fs.
+ */
+ KERNFS_ROOT_SUPPORT_EXPORTOP = 0x0004,
};

/* type-specific structures for kernfs_node union members */
@@ -98,6 +104,12 @@ struct kernfs_elem_attr {
/* represent a kernfs node */
union kernfs_node_id {
struct {
+ /*
+ * blktrace will export this struct as a simplified 'struct
+ * fid' (which is a big data struction), so userspace can use
+ * it to find kernfs node. The layout must match the first two
+ * fields of 'struct fid' exactly.
+ */
u32 ino;
u32 generation;
};
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index cc53111..8f88aac 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1737,7 +1737,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
&cgroup_kf_syscall_ops : &cgroup1_kf_syscall_ops;

root->kf_root = kernfs_create_root(kf_sops,
- KERNFS_ROOT_CREATE_DEACTIVATED,
+ KERNFS_ROOT_CREATE_DEACTIVATED |
+ KERNFS_ROOT_SUPPORT_EXPORTOP,
root_cgrp);
if (IS_ERR(root->kf_root)) {
ret = PTR_ERR(root->kf_root);
--
2.9.3

2017-07-12 18:50:40

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 08/11] blktrace: export cgroup info in trace

From: Shaohua Li <[email protected]>

Currently blktrace isn't cgroup aware. blktrace prints out task name of
current context, but the task of current context isn't always in the
cgroup where the BIO comes from. We can't use task name to find out IO
cgroup. For example, Writeback BIOs always comes from flusher thread but
the BIOs are for different blk cgroups. Request could be requeued and
dispatched from completely different tasks. MD/DM are another examples.

This patch tries to fix the gap. We print out cgroup fhandle info in
blktrace. Userspace can use open_by_handle_at() syscall to find the
cgroup by fhandle. Or userspace can use name_to_handle_at() syscall to
find fhandle for a cgroup and use a BPF program to filter out blktrace
for a specific cgroup.

We add a new 'blk_cgroup' trace option for blk tracer. It's default off.
Application which doesn't know the new option isn't affected. When it's
on, we output fhandle info right after blk_io_trace with an extra bit
set in event action. So from application point of view, blktrace with
the option will output new actions.

I didn't change blk trace event yet, since I'm not sure if changing the
trace event output is an ABI issue. If not, I'll do it later.

Acked-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
include/uapi/linux/blktrace_api.h | 3 +
kernel/trace/blktrace.c | 231 ++++++++++++++++++++++++++------------
2 files changed, 161 insertions(+), 73 deletions(-)

diff --git a/include/uapi/linux/blktrace_api.h b/include/uapi/linux/blktrace_api.h
index c590ca6..9cdaede 100644
--- a/include/uapi/linux/blktrace_api.h
+++ b/include/uapi/linux/blktrace_api.h
@@ -52,6 +52,7 @@ enum blktrace_act {
__BLK_TA_REMAP, /* bio was remapped */
__BLK_TA_ABORT, /* request aborted */
__BLK_TA_DRV_DATA, /* driver-specific binary data */
+ __BLK_TA_CGROUP = 1 << 8, /* from a cgroup*/
};

/*
@@ -61,6 +62,7 @@ enum blktrace_notify {
__BLK_TN_PROCESS = 0, /* establish pid/name mapping */
__BLK_TN_TIMESTAMP, /* include system clock */
__BLK_TN_MESSAGE, /* Character string message */
+ __BLK_TN_CGROUP = __BLK_TA_CGROUP, /* from a cgroup */
};


@@ -107,6 +109,7 @@ struct blk_io_trace {
__u32 cpu; /* on what cpu did it happen */
__u16 error; /* completion error */
__u16 pdu_len; /* length of data after this trace */
+ /* cgroup id will be stored here if exists */
};

/*
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index bc364f8..f393d7a 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -27,6 +27,7 @@
#include <linux/time.h>
#include <linux/uaccess.h>
#include <linux/list.h>
+#include <linux/blk-cgroup.h>

#include "../../block/blk.h"

@@ -46,10 +47,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);

/* Select an alternative, minimalistic output than the original one */
#define TRACE_BLK_OPT_CLASSIC 0x1
+#define TRACE_BLK_OPT_CGROUP 0x2

static struct tracer_opt blk_tracer_opts[] = {
/* Default disable the minimalistic output */
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
+#ifdef CONFIG_BLK_CGROUP
+ { TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+#endif
{ }
};

@@ -68,7 +73,8 @@ static void blk_unregister_tracepoints(void);
* Send out a notify message.
*/
static void trace_note(struct blk_trace *bt, pid_t pid, int action,
- const void *data, size_t len)
+ const void *data, size_t len,
+ union kernfs_node_id *cgid)
{
struct blk_io_trace *t;
struct ring_buffer_event *event = NULL;
@@ -76,12 +82,13 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
int pc = 0;
int cpu = smp_processor_id();
bool blk_tracer = blk_tracer_enabled;
+ ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;

if (blk_tracer) {
buffer = blk_tr->trace_buffer.buffer;
pc = preempt_count();
event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
- sizeof(*t) + len,
+ sizeof(*t) + len + cgid_len,
0, pc);
if (!event)
return;
@@ -92,17 +99,19 @@ static void trace_note(struct blk_trace *bt, pid_t pid, int action,
if (!bt->rchan)
return;

- t = relay_reserve(bt->rchan, sizeof(*t) + len);
+ t = relay_reserve(bt->rchan, sizeof(*t) + len + cgid_len);
if (t) {
t->magic = BLK_IO_TRACE_MAGIC | BLK_IO_TRACE_VERSION;
t->time = ktime_to_ns(ktime_get());
record_it:
t->device = bt->dev;
- t->action = action;
+ t->action = action | (cgid ? __BLK_TN_CGROUP : 0);
t->pid = pid;
t->cpu = cpu;
- t->pdu_len = len;
- memcpy((void *) t + sizeof(*t), data, len);
+ t->pdu_len = len + cgid_len;
+ if (cgid)
+ memcpy((void *)t + sizeof(*t), cgid, cgid_len);
+ memcpy((void *) t + sizeof(*t) + cgid_len, data, len);

if (blk_tracer)
trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -122,7 +131,7 @@ static void trace_note_tsk(struct task_struct *tsk)
spin_lock_irqsave(&running_trace_lock, flags);
list_for_each_entry(bt, &running_trace_list, running_list) {
trace_note(bt, tsk->pid, BLK_TN_PROCESS, tsk->comm,
- sizeof(tsk->comm));
+ sizeof(tsk->comm), NULL);
}
spin_unlock_irqrestore(&running_trace_lock, flags);
}
@@ -139,7 +148,7 @@ static void trace_note_time(struct blk_trace *bt)
words[1] = now.tv_nsec;

local_irq_save(flags);
- trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words));
+ trace_note(bt, 0, BLK_TN_TIMESTAMP, words, sizeof(words), NULL);
local_irq_restore(flags);
}

@@ -167,7 +176,7 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
va_end(args);

- trace_note(bt, 0, BLK_TN_MESSAGE, buf, n);
+ trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -204,7 +213,7 @@ static const u32 ddir_act[2] = { BLK_TC_ACT(BLK_TC_READ),
*/
static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
int op, int op_flags, u32 what, int error, int pdu_len,
- void *pdu_data)
+ void *pdu_data, union kernfs_node_id *cgid)
{
struct task_struct *tsk = current;
struct ring_buffer_event *event = NULL;
@@ -215,6 +224,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
pid_t pid;
int cpu, pc = 0;
bool blk_tracer = blk_tracer_enabled;
+ ssize_t cgid_len = cgid ? sizeof(*cgid) : 0;

if (unlikely(bt->trace_state != Blktrace_running && !blk_tracer))
return;
@@ -229,6 +239,8 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
what |= BLK_TC_ACT(BLK_TC_DISCARD);
if (op == REQ_OP_FLUSH)
what |= BLK_TC_ACT(BLK_TC_FLUSH);
+ if (cgid)
+ what |= __BLK_TA_CGROUP;

pid = tsk->pid;
if (act_log_check(bt, what, sector, pid))
@@ -241,7 +253,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
buffer = blk_tr->trace_buffer.buffer;
pc = preempt_count();
event = trace_buffer_lock_reserve(buffer, TRACE_BLK,
- sizeof(*t) + pdu_len,
+ sizeof(*t) + pdu_len + cgid_len,
0, pc);
if (!event)
return;
@@ -258,7 +270,7 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
* from coming in and stepping on our toes.
*/
local_irq_save(flags);
- t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len);
+ t = relay_reserve(bt->rchan, sizeof(*t) + pdu_len + cgid_len);
if (t) {
sequence = per_cpu_ptr(bt->sequence, cpu);

@@ -280,10 +292,12 @@ static void __blk_add_trace(struct blk_trace *bt, sector_t sector, int bytes,
t->action = what;
t->device = bt->dev;
t->error = error;
- t->pdu_len = pdu_len;
+ t->pdu_len = pdu_len + cgid_len;

+ if (cgid_len)
+ memcpy((void *)t + sizeof(*t), cgid, cgid_len);
if (pdu_len)
- memcpy((void *) t + sizeof(*t), pdu_data, pdu_len);
+ memcpy((void *)t + sizeof(*t) + cgid_len, pdu_data, pdu_len);

if (blk_tracer) {
trace_buffer_unlock_commit(blk_tr, buffer, event, 0, pc);
@@ -684,6 +698,36 @@ void blk_trace_shutdown(struct request_queue *q)
}
}

+#ifdef CONFIG_BLK_CGROUP
+static union kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+ struct blk_trace *bt = q->blk_trace;
+
+ if (!bt || !(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+ return NULL;
+
+ if (!bio->bi_css)
+ return NULL;
+ return cgroup_get_kernfs_id(bio->bi_css->cgroup);
+}
+#else
+static union kernfs_node_id *
+blk_trace_bio_get_cgid(struct request_queue *q, struct bio *bio)
+{
+ return NULL;
+}
+#endif
+
+static union kernfs_node_id *
+blk_trace_request_get_cgid(struct request_queue *q, struct request *rq)
+{
+ if (!rq->bio)
+ return NULL;
+ /* Use the first bio */
+ return blk_trace_bio_get_cgid(q, rq->bio);
+}
+
/*
* blktrace probes
*/
@@ -694,13 +738,15 @@ void blk_trace_shutdown(struct request_queue *q)
* @error: return status to log
* @nr_bytes: number of completed bytes
* @what: the action
+ * @cgid: the cgroup info
*
* Description:
* Records an action against a request. Will log the bio offset + size.
*
**/
static void blk_add_trace_rq(struct request *rq, int error,
- unsigned int nr_bytes, u32 what)
+ unsigned int nr_bytes, u32 what,
+ union kernfs_node_id *cgid)
{
struct blk_trace *bt = rq->q->blk_trace;

@@ -713,32 +759,36 @@ static void blk_add_trace_rq(struct request *rq, int error,
what |= BLK_TC_ACT(BLK_TC_FS);

__blk_add_trace(bt, blk_rq_trace_sector(rq), nr_bytes, req_op(rq),
- rq->cmd_flags, what, error, 0, NULL);
+ rq->cmd_flags, what, error, 0, NULL, cgid);
}

static void blk_add_trace_rq_insert(void *ignore,
struct request_queue *q, struct request *rq)
{
- blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT);
+ blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_INSERT,
+ blk_trace_request_get_cgid(q, rq));
}

static void blk_add_trace_rq_issue(void *ignore,
struct request_queue *q, struct request *rq)
{
- blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE);
+ blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_ISSUE,
+ blk_trace_request_get_cgid(q, rq));
}

static void blk_add_trace_rq_requeue(void *ignore,
struct request_queue *q,
struct request *rq)
{
- blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE);
+ blk_add_trace_rq(rq, 0, blk_rq_bytes(rq), BLK_TA_REQUEUE,
+ blk_trace_request_get_cgid(q, rq));
}

static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
int error, unsigned int nr_bytes)
{
- blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE);
+ blk_add_trace_rq(rq, error, nr_bytes, BLK_TA_COMPLETE,
+ blk_trace_request_get_cgid(rq->q, rq));
}

/**
@@ -753,7 +803,7 @@ static void blk_add_trace_rq_complete(void *ignore, struct request *rq,
*
**/
static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
- u32 what, int error)
+ u32 what, int error, union kernfs_node_id *cgid)
{
struct blk_trace *bt = q->blk_trace;

@@ -761,20 +811,22 @@ static void blk_add_trace_bio(struct request_queue *q, struct bio *bio,
return;

__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
- bio_op(bio), bio->bi_opf, what, error, 0, NULL);
+ bio_op(bio), bio->bi_opf, what, error, 0, NULL, cgid);
}

static void blk_add_trace_bio_bounce(void *ignore,
struct request_queue *q, struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_BOUNCE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}

static void blk_add_trace_bio_complete(void *ignore,
struct request_queue *q, struct bio *bio,
int error)
{
- blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error);
+ blk_add_trace_bio(q, bio, BLK_TA_COMPLETE, error,
+ blk_trace_bio_get_cgid(q, bio));
}

static void blk_add_trace_bio_backmerge(void *ignore,
@@ -782,7 +834,8 @@ static void blk_add_trace_bio_backmerge(void *ignore,
struct request *rq,
struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}

static void blk_add_trace_bio_frontmerge(void *ignore,
@@ -790,13 +843,15 @@ static void blk_add_trace_bio_frontmerge(void *ignore,
struct request *rq,
struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}

static void blk_add_trace_bio_queue(void *ignore,
struct request_queue *q, struct bio *bio)
{
- blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_QUEUE, 0,
+ blk_trace_bio_get_cgid(q, bio));
}

static void blk_add_trace_getrq(void *ignore,
@@ -804,13 +859,14 @@ static void blk_add_trace_getrq(void *ignore,
struct bio *bio, int rw)
{
if (bio)
- blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_GETRQ, 0,
+ blk_trace_bio_get_cgid(q, bio));
else {
struct blk_trace *bt = q->blk_trace;

if (bt)
__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_GETRQ, 0, 0,
- NULL);
+ NULL, NULL);
}
}

@@ -820,13 +876,14 @@ static void blk_add_trace_sleeprq(void *ignore,
struct bio *bio, int rw)
{
if (bio)
- blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0);
+ blk_add_trace_bio(q, bio, BLK_TA_SLEEPRQ, 0,
+ blk_trace_bio_get_cgid(q, bio));
else {
struct blk_trace *bt = q->blk_trace;

if (bt)
__blk_add_trace(bt, 0, 0, rw, 0, BLK_TA_SLEEPRQ,
- 0, 0, NULL);
+ 0, 0, NULL, NULL);
}
}

@@ -835,7 +892,7 @@ static void blk_add_trace_plug(void *ignore, struct request_queue *q)
struct blk_trace *bt = q->blk_trace;

if (bt)
- __blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL);
+ __blk_add_trace(bt, 0, 0, 0, 0, BLK_TA_PLUG, 0, 0, NULL, NULL);
}

static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
@@ -852,7 +909,7 @@ static void blk_add_trace_unplug(void *ignore, struct request_queue *q,
else
what = BLK_TA_UNPLUG_TIMER;

- __blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu);
+ __blk_add_trace(bt, 0, 0, 0, 0, what, 0, sizeof(rpdu), &rpdu, NULL);
}
}

@@ -868,7 +925,7 @@ static void blk_add_trace_split(void *ignore,
__blk_add_trace(bt, bio->bi_iter.bi_sector,
bio->bi_iter.bi_size, bio_op(bio), bio->bi_opf,
BLK_TA_SPLIT, bio->bi_status, sizeof(rpdu),
- &rpdu);
+ &rpdu, blk_trace_bio_get_cgid(q, bio));
}
}

@@ -901,7 +958,7 @@ static void blk_add_trace_bio_remap(void *ignore,

__blk_add_trace(bt, bio->bi_iter.bi_sector, bio->bi_iter.bi_size,
bio_op(bio), bio->bi_opf, BLK_TA_REMAP, bio->bi_status,
- sizeof(r), &r);
+ sizeof(r), &r, blk_trace_bio_get_cgid(q, bio));
}

/**
@@ -934,7 +991,7 @@ static void blk_add_trace_rq_remap(void *ignore,

__blk_add_trace(bt, blk_rq_pos(rq), blk_rq_bytes(rq),
rq_data_dir(rq), 0, BLK_TA_REMAP, 0,
- sizeof(r), &r);
+ sizeof(r), &r, blk_trace_request_get_cgid(q, rq));
}

/**
@@ -958,7 +1015,8 @@ void blk_add_driver_data(struct request_queue *q,
return;

__blk_add_trace(bt, blk_rq_trace_sector(rq), blk_rq_bytes(rq), 0, 0,
- BLK_TA_DRV_DATA, 0, len, data);
+ BLK_TA_DRV_DATA, 0, len, data,
+ blk_trace_request_get_cgid(q, rq));
}
EXPORT_SYMBOL_GPL(blk_add_driver_data);

@@ -1031,7 +1089,7 @@ static void fill_rwbs(char *rwbs, const struct blk_io_trace *t)
int i = 0;
int tc = t->action >> BLK_TC_SHIFT;

- if (t->action == BLK_TN_MESSAGE) {
+ if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
rwbs[i++] = 'N';
goto out;
}
@@ -1066,9 +1124,21 @@ const struct blk_io_trace *te_blk_io_trace(const struct trace_entry *ent)
return (const struct blk_io_trace *)ent;
}

-static inline const void *pdu_start(const struct trace_entry *ent)
+static inline const void *pdu_start(const struct trace_entry *ent, bool has_cg)
{
- return te_blk_io_trace(ent) + 1;
+ return (void *)(te_blk_io_trace(ent) + 1) +
+ (has_cg ? sizeof(union kernfs_node_id) : 0);
+}
+
+static inline const void *cgid_start(const struct trace_entry *ent)
+{
+ return (void *)(te_blk_io_trace(ent) + 1);
+}
+
+static inline int pdu_real_len(const struct trace_entry *ent, bool has_cg)
+{
+ return te_blk_io_trace(ent)->pdu_len -
+ (has_cg ? sizeof(union kernfs_node_id) : 0);
}

static inline u32 t_action(const struct trace_entry *ent)
@@ -1096,16 +1166,16 @@ static inline __u16 t_error(const struct trace_entry *ent)
return te_blk_io_trace(ent)->error;
}

-static __u64 get_pdu_int(const struct trace_entry *ent)
+static __u64 get_pdu_int(const struct trace_entry *ent, bool has_cg)
{
- const __u64 *val = pdu_start(ent);
+ const __u64 *val = pdu_start(ent, has_cg);
return be64_to_cpu(*val);
}

static void get_pdu_remap(const struct trace_entry *ent,
- struct blk_io_trace_remap *r)
+ struct blk_io_trace_remap *r, bool has_cg)
{
- const struct blk_io_trace_remap *__r = pdu_start(ent);
+ const struct blk_io_trace_remap *__r = pdu_start(ent, has_cg);
__u64 sector_from = __r->sector_from;

r->device_from = be32_to_cpu(__r->device_from);
@@ -1113,9 +1183,11 @@ static void get_pdu_remap(const struct trace_entry *ent,
r->sector_from = be64_to_cpu(sector_from);
}

-typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act);
+typedef void (blk_log_action_t) (struct trace_iterator *iter, const char *act,
+ bool has_cg);

-static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
+static void blk_log_action_classic(struct trace_iterator *iter, const char *act,
+ bool has_cg)
{
char rwbs[RWBS_LEN];
unsigned long long ts = iter->ts;
@@ -1131,24 +1203,33 @@ static void blk_log_action_classic(struct trace_iterator *iter, const char *act)
secs, nsec_rem, iter->ent->pid, act, rwbs);
}

-static void blk_log_action(struct trace_iterator *iter, const char *act)
+static void blk_log_action(struct trace_iterator *iter, const char *act,
+ bool has_cg)
{
char rwbs[RWBS_LEN];
const struct blk_io_trace *t = te_blk_io_trace(iter->ent);

fill_rwbs(rwbs, t);
- trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
- MAJOR(t->device), MINOR(t->device), act, rwbs);
+ if (has_cg) {
+ const union kernfs_node_id *id = cgid_start(iter->ent);
+
+ trace_seq_printf(&iter->seq, "%3d,%-3d %x,%-x %2s %3s ",
+ MAJOR(t->device), MINOR(t->device),
+ id->ino, id->generation, act, rwbs);
+ } else
+ trace_seq_printf(&iter->seq, "%3d,%-3d %2s %3s ",
+ MAJOR(t->device), MINOR(t->device), act, rwbs);
}

-static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_dump_pdu(struct trace_seq *s,
+ const struct trace_entry *ent, bool has_cg)
{
const unsigned char *pdu_buf;
int pdu_len;
int i, end;

- pdu_buf = pdu_start(ent);
- pdu_len = te_blk_io_trace(ent)->pdu_len;
+ pdu_buf = pdu_start(ent, has_cg);
+ pdu_len = pdu_real_len(ent, has_cg);

if (!pdu_len)
return;
@@ -1179,7 +1260,7 @@ static void blk_log_dump_pdu(struct trace_seq *s, const struct trace_entry *ent)
trace_seq_puts(s, ") ");
}

-static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];

@@ -1187,7 +1268,7 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)

if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
trace_seq_printf(s, "%u ", t_bytes(ent));
- blk_log_dump_pdu(s, ent);
+ blk_log_dump_pdu(s, ent, has_cg);
trace_seq_printf(s, "[%s]\n", cmd);
} else {
if (t_sec(ent))
@@ -1199,10 +1280,10 @@ static void blk_log_generic(struct trace_seq *s, const struct trace_entry *ent)
}

static void blk_log_with_error(struct trace_seq *s,
- const struct trace_entry *ent)
+ const struct trace_entry *ent, bool has_cg)
{
if (t_action(ent) & BLK_TC_ACT(BLK_TC_PC)) {
- blk_log_dump_pdu(s, ent);
+ blk_log_dump_pdu(s, ent, has_cg);
trace_seq_printf(s, "[%d]\n", t_error(ent));
} else {
if (t_sec(ent))
@@ -1215,18 +1296,18 @@ static void blk_log_with_error(struct trace_seq *s,
}
}

-static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_remap(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
struct blk_io_trace_remap r = { .device_from = 0, };

- get_pdu_remap(ent, &r);
+ get_pdu_remap(ent, &r, has_cg);
trace_seq_printf(s, "%llu + %u <- (%d,%d) %llu\n",
t_sector(ent), t_sec(ent),
MAJOR(r.device_from), MINOR(r.device_from),
(unsigned long long)r.sector_from);
}

-static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];

@@ -1235,30 +1316,31 @@ static void blk_log_plug(struct trace_seq *s, const struct trace_entry *ent)
trace_seq_printf(s, "[%s]\n", cmd);
}

-static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_unplug(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];

trace_find_cmdline(ent->pid, cmd);

- trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent));
+ trace_seq_printf(s, "[%s] %llu\n", cmd, get_pdu_int(ent, has_cg));
}

-static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_split(struct trace_seq *s, const struct trace_entry *ent, bool has_cg)
{
char cmd[TASK_COMM_LEN];

trace_find_cmdline(ent->pid, cmd);

trace_seq_printf(s, "%llu / %llu [%s]\n", t_sector(ent),
- get_pdu_int(ent), cmd);
+ get_pdu_int(ent, has_cg), cmd);
}

-static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent)
+static void blk_log_msg(struct trace_seq *s, const struct trace_entry *ent,
+ bool has_cg)
{
- const struct blk_io_trace *t = te_blk_io_trace(ent);

- trace_seq_putmem(s, t + 1, t->pdu_len);
+ trace_seq_putmem(s, pdu_start(ent, has_cg),
+ pdu_real_len(ent, has_cg));
trace_seq_putc(s, '\n');
}

@@ -1298,7 +1380,8 @@ static void blk_tracer_reset(struct trace_array *tr)

static const struct {
const char *act[2];
- void (*print)(struct trace_seq *s, const struct trace_entry *ent);
+ void (*print)(struct trace_seq *s, const struct trace_entry *ent,
+ bool has_cg);
} what2act[] = {
[__BLK_TA_QUEUE] = {{ "Q", "queue" }, blk_log_generic },
[__BLK_TA_BACKMERGE] = {{ "M", "backmerge" }, blk_log_generic },
@@ -1326,23 +1409,25 @@ static enum print_line_t print_one_line(struct trace_iterator *iter,
u16 what;
bool long_act;
blk_log_action_t *log_action;
+ bool has_cg;

t = te_blk_io_trace(iter->ent);
- what = t->action & ((1 << BLK_TC_SHIFT) - 1);
+ what = (t->action & ((1 << BLK_TC_SHIFT) - 1)) & ~__BLK_TA_CGROUP;
long_act = !!(tr->trace_flags & TRACE_ITER_VERBOSE);
log_action = classic ? &blk_log_action_classic : &blk_log_action;
+ has_cg = t->action & __BLK_TA_CGROUP;

- if (t->action == BLK_TN_MESSAGE) {
- log_action(iter, long_act ? "message" : "m");
- blk_log_msg(s, iter->ent);
+ if ((t->action & ~__BLK_TN_CGROUP) == BLK_TN_MESSAGE) {
+ log_action(iter, long_act ? "message" : "m", has_cg);
+ blk_log_msg(s, iter->ent, has_cg);
return trace_handle_return(s);
}

if (unlikely(what == 0 || what >= ARRAY_SIZE(what2act)))
trace_seq_printf(s, "Unknown action %x\n", what);
else {
- log_action(iter, what2act[what].act[long_act]);
- what2act[what].print(s, iter->ent);
+ log_action(iter, what2act[what].act[long_act], has_cg);
+ what2act[what].print(s, iter->ent, has_cg);
}

return trace_handle_return(s);
--
2.9.3

2017-07-12 18:50:53

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 07/11] cgroup: export fhandle info for a cgroup

From: Shaohua Li <[email protected]>

Add an API to export cgroup fhandle info. We don't export a full 'struct
file_handle', there are unrequired info. Sepcifically, cgroup is always
a directory, so we don't need a 'FILEID_INO32_GEN_PARENT' type fhandle,
we only need export the inode number and generation number just like
what generic_fh_to_dentry does. And we can avoid the overhead of getting
an inode too, since kernfs_node_id (ino and generation) has all the info
required.

Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
include/linux/cgroup.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 30c6877..52ef9a6 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -609,6 +609,10 @@ static inline void cgroup_kthread_ready(void)
current->no_cgroup_migration = 0;
}

+static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
+{
+ return &cgrp->kn->id;
+}
#else /* !CONFIG_CGROUPS */

struct cgroup_subsys_state;
@@ -631,6 +635,10 @@ static inline int cgroup_init_early(void) { return 0; }
static inline int cgroup_init(void) { return 0; }
static inline void cgroup_init_kthreadd(void) {}
static inline void cgroup_kthread_ready(void) {}
+static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
+{
+ return NULL;
+}

static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
struct cgroup *ancestor)
--
2.9.3

2017-07-12 18:51:11

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 11/11] block: use standard blktrace API to output cgroup info for debug notes

From: Shaohua Li <[email protected]>

Currently cfq/bfq/blk-throttle output cgroup info in trace in their own
way. Now we have standard blktrace API for this, so convert them to use
it.

Note, this changes the behavior a little bit. cgroup info isn't output
by default, we only do this with 'blk_cgroup' option enabled. cgroup
info isn't output as a string by default too, we only do this with
'blk_cgname' option enabled. Also cgroup info is output in different
position of the note string. I think these behavior changes aren't a big
issue (actually we make trace data shorter which is good), since the
blktrace note is solely for debugging.

Signed-off-by: Shaohua Li <[email protected]>
---
block/bfq-iosched.h | 13 ++++++++-----
block/blk-throttle.c | 6 ++----
block/cfq-iosched.c | 15 ++++++---------
include/linux/blktrace_api.h | 13 +++++++++----
kernel/trace/blktrace.c | 12 ++++++++++--
5 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h
index 8fd83b8..bf69d13 100644
--- a/block/bfq-iosched.h
+++ b/block/bfq-iosched.h
@@ -917,13 +917,16 @@ void bfq_add_bfqq_busy(struct bfq_data *bfqd, struct bfq_queue *bfqq);
struct bfq_group *bfqq_group(struct bfq_queue *bfqq);

#define bfq_log_bfqq(bfqd, bfqq, fmt, args...) do { \
- blk_add_trace_msg((bfqd)->queue, "bfq%d%c %s " fmt, (bfqq)->pid,\
- bfq_bfqq_sync((bfqq)) ? 'S' : 'A', \
- bfqq_group(bfqq)->blkg_path, ##args); \
+ blk_add_cgroup_trace_msg((bfqd)->queue, \
+ bfqg_to_blkg(bfqq_group(bfqq))->blkcg, \
+ "bfq%d%c " fmt, (bfqq)->pid, \
+ bfq_bfqq_sync((bfqq)) ? 'S' : 'A', ##args); \
} while (0)

-#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) \
- blk_add_trace_msg((bfqd)->queue, "%s " fmt, (bfqg)->blkg_path, ##args)
+#define bfq_log_bfqg(bfqd, bfqg, fmt, args...) do { \
+ blk_add_cgroup_trace_msg((bfqd)->queue, \
+ bfqg_to_blkg(bfqg)->blkcg, fmt, ##args); \
+} while (0)

#else /* CONFIG_BFQ_GROUP_IOSCHED */

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a6ebd2b..6a4c4c4 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -373,10 +373,8 @@ static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
if (likely(!blk_trace_note_message_enabled(__td->queue))) \
break; \
if ((__tg)) { \
- char __pbuf[128]; \
- \
- blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, ##args); \
+ blk_add_cgroup_trace_msg(__td->queue, \
+ tg_to_blkg(__tg)->blkcg, "throtl " fmt, ##args);\
} else { \
blk_add_trace_msg(__td->queue, "throtl " fmt, ##args); \
} \
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 3d5c289..0fb78fb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -656,20 +656,17 @@ static inline void cfqg_put(struct cfq_group *cfqg)
}

#define cfq_log_cfqq(cfqd, cfqq, fmt, args...) do { \
- char __pbuf[128]; \
- \
- blkg_path(cfqg_to_blkg((cfqq)->cfqg), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg((cfqd)->queue, "cfq%d%c%c %s " fmt, (cfqq)->pid, \
+ blk_add_cgroup_trace_msg((cfqd)->queue, \
+ cfqg_to_blkg((cfqq)->cfqg)->blkcg, \
+ "cfq%d%c%c " fmt, (cfqq)->pid, \
cfq_cfqq_sync((cfqq)) ? 'S' : 'A', \
cfqq_type((cfqq)) == SYNC_NOIDLE_WORKLOAD ? 'N' : ' ',\
- __pbuf, ##args); \
+ ##args); \
} while (0)

#define cfq_log_cfqg(cfqd, cfqg, fmt, args...) do { \
- char __pbuf[128]; \
- \
- blkg_path(cfqg_to_blkg(cfqg), __pbuf, sizeof(__pbuf)); \
- blk_add_trace_msg((cfqd)->queue, "%s " fmt, __pbuf, ##args); \
+ blk_add_cgroup_trace_msg((cfqd)->queue, \
+ cfqg_to_blkg(cfqg)->blkcg, fmt, ##args); \
} while (0)

static inline void cfqg_stats_update_io_add(struct cfq_group *cfqg,
diff --git a/include/linux/blktrace_api.h b/include/linux/blktrace_api.h
index d2e9085..67b4d4d 100644
--- a/include/linux/blktrace_api.h
+++ b/include/linux/blktrace_api.h
@@ -28,10 +28,12 @@ struct blk_trace {
atomic_t dropped;
};

+struct blkcg;
+
extern int blk_trace_ioctl(struct block_device *, unsigned, char __user *);
extern void blk_trace_shutdown(struct request_queue *);
-extern __printf(2, 3)
-void __trace_note_message(struct blk_trace *, const char *fmt, ...);
+extern __printf(3, 4)
+void __trace_note_message(struct blk_trace *, struct blkcg *blkcg, const char *fmt, ...);

/**
* blk_add_trace_msg - Add a (simple) message to the blktrace stream
@@ -46,12 +48,14 @@ void __trace_note_message(struct blk_trace *, const char *fmt, ...);
* NOTE: Can not use 'static inline' due to presence of var args...
*
**/
-#define blk_add_trace_msg(q, fmt, ...) \
+#define blk_add_cgroup_trace_msg(q, cg, fmt, ...) \
do { \
struct blk_trace *bt = (q)->blk_trace; \
if (unlikely(bt)) \
- __trace_note_message(bt, fmt, ##__VA_ARGS__); \
+ __trace_note_message(bt, cg, fmt, ##__VA_ARGS__);\
} while (0)
+#define blk_add_trace_msg(q, fmt, ...) \
+ blk_add_cgroup_trace_msg(q, NULL, fmt, ##__VA_ARGS__)
#define BLK_TN_MAX_MSG 128

static inline bool blk_trace_note_message_enabled(struct request_queue *q)
@@ -82,6 +86,7 @@ extern struct attribute_group blk_trace_attr_group;
# define blk_trace_startstop(q, start) (-ENOTTY)
# define blk_trace_remove(q) (-ENOTTY)
# define blk_add_trace_msg(q, fmt, ...) do { } while (0)
+# define blk_add_cgroup_trace_msg(q, cg, fmt, ...) do { } while (0)
# define blk_trace_remove_sysfs(dev) do { } while (0)
# define blk_trace_note_message_enabled(q) (false)
static inline int blk_trace_init_sysfs(struct device *dev)
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index e90974e..7724de1 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -154,7 +154,8 @@ static void trace_note_time(struct blk_trace *bt)
local_irq_restore(flags);
}

-void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
+void __trace_note_message(struct blk_trace *bt, struct blkcg *blkcg,
+ const char *fmt, ...)
{
int n;
va_list args;
@@ -178,7 +179,14 @@ void __trace_note_message(struct blk_trace *bt, const char *fmt, ...)
n = vscnprintf(buf, BLK_TN_MAX_MSG, fmt, args);
va_end(args);

+ if (!(blk_tracer_flags.val & TRACE_BLK_OPT_CGROUP))
+ blkcg = NULL;
+#ifdef CONFIG_BLK_CGROUP
+ trace_note(bt, 0, BLK_TN_MESSAGE, buf, n,
+ blkcg ? cgroup_get_kernfs_id(blkcg->css.cgroup) : NULL);
+#else
trace_note(bt, 0, BLK_TN_MESSAGE, buf, n, NULL);
+#endif
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(__trace_note_message);
@@ -375,7 +383,7 @@ static ssize_t blk_msg_write(struct file *filp, const char __user *buffer,
return PTR_ERR(msg);

bt = filp->private_data;
- __trace_note_message(bt, "%s", msg);
+ __trace_note_message(bt, NULL, "%s", msg);
kfree(msg);

return count;
--
2.9.3

2017-07-12 18:51:30

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 09/11] block: always attach cgroup info into bio

From: Shaohua Li <[email protected]>

blkcg_bio_issue_check() already gets blkcg for a BIO.
bio_associate_blkcg() uses a percpu refcounter, so it's a very cheap
operation. There is no point we don't attach the cgroup info into bio at
blkcg_bio_issue_check. This also makes blktrace outputs correct cgroup
info.

Acked-by: Tejun Heo <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
block/blk-throttle.c | 7 +------
include/linux/blk-cgroup.h | 3 +++
2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a7285bf..a6ebd2b 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -2104,14 +2104,9 @@ static inline void throtl_update_latency_buckets(struct throtl_data *td)
static void blk_throtl_assoc_bio(struct throtl_grp *tg, struct bio *bio)
{
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
- int ret;
-
- ret = bio_associate_current(bio);
- if (ret == 0 || ret == -EBUSY)
+ if (bio->bi_css)
bio->bi_cg_private = tg;
blk_stat_set_issue(&bio->bi_issue_stat, bio_sectors(bio));
-#else
- bio_associate_current(bio);
#endif
}

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 7104bea..9d92153 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -691,6 +691,9 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
rcu_read_lock();
blkcg = bio_blkcg(bio);

+ /* associate blkcg if bio hasn't attached one */
+ bio_associate_blkcg(bio, &blkcg->css);
+
blkg = blkg_lookup(blkcg, q);
if (unlikely(!blkg)) {
spin_lock_irq(q->queue_lock);
--
2.9.3

2017-07-12 18:51:28

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 10/11] blktrace: add an option to allow displaying cgroup path

From: Shaohua Li <[email protected]>

By default we output cgroup id in blktrace. This adds an option to
display cgroup path. Since get cgroup path is a relativly heavy
operation, we don't enable it by default.

with the option enabled, blktrace will output something like this:
dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd]

Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/mount.c | 19 +++++++++++++++++++
include/linux/cgroup.h | 6 ++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup/cgroup.c | 12 ++++++++++++
kernel/trace/blktrace.c | 14 +++++++++++++-
5 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index fa32358..7c452f4 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -65,6 +65,25 @@ const struct super_operations kernfs_sops = {
.show_path = kernfs_sop_show_path,
};

+/*
+ * Similar to kernfs_fh_get_inode, this one gets kernfs node from inode
+ * number and generation
+ */
+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+ const union kernfs_node_id *id)
+{
+ struct kernfs_node *kn;
+
+ kn = kernfs_find_and_get_node_by_ino(root, id->ino);
+ if (!kn)
+ return NULL;
+ if (kn->id.generation != id->generation) {
+ kernfs_put(kn);
+ return NULL;
+ }
+ return kn;
+}
+
static struct inode *kernfs_fh_get_inode(struct super_block *sb,
u64 ino, u32 generation)
{
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 52ef9a6..6144fe9 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -613,6 +613,9 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp)
{
return &cgrp->kn->id;
}
+
+void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+ char *buf, size_t buflen);
#else /* !CONFIG_CGROUPS */

struct cgroup_subsys_state;
@@ -645,6 +648,9 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
{
return true;
}
+
+static inline void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+ char *buf, size_t buflen) {}
#endif /* !CONFIG_CGROUPS */

/*
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d149361..ab25c8b 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -358,6 +358,8 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);

void kernfs_init(void);

+struct kernfs_node *kernfs_get_node_by_id(struct kernfs_root *root,
+ const union kernfs_node_id *id);
#else /* CONFIG_KERNFS */

static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8f88aac..c7fae93 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4705,6 +4705,18 @@ static int __init cgroup_wq_init(void)
}
core_initcall(cgroup_wq_init);

+void cgroup_path_from_kernfs_id(const union kernfs_node_id *id,
+ char *buf, size_t buflen)
+{
+ struct kernfs_node *kn;
+
+ kn = kernfs_get_node_by_id(cgrp_dfl_root.kf_root, id);
+ if (!kn)
+ return;
+ kernfs_path(kn, buf, buflen);
+ kernfs_put(kn);
+}
+
/*
* proc_cgroup_show()
* - Print task's cgroup paths into seq_file, one line for each hierarchy
diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index f393d7a..e90974e 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -48,12 +48,14 @@ static __cacheline_aligned_in_smp DEFINE_SPINLOCK(running_trace_lock);
/* Select an alternative, minimalistic output than the original one */
#define TRACE_BLK_OPT_CLASSIC 0x1
#define TRACE_BLK_OPT_CGROUP 0x2
+#define TRACE_BLK_OPT_CGNAME 0x4

static struct tracer_opt blk_tracer_opts[] = {
/* Default disable the minimalistic output */
{ TRACER_OPT(blk_classic, TRACE_BLK_OPT_CLASSIC) },
#ifdef CONFIG_BLK_CGROUP
{ TRACER_OPT(blk_cgroup, TRACE_BLK_OPT_CGROUP) },
+ { TRACER_OPT(blk_cgname, TRACE_BLK_OPT_CGNAME) },
#endif
{ }
};
@@ -1213,7 +1215,17 @@ static void blk_log_action(struct trace_iterator *iter, const char *act,
if (has_cg) {
const union kernfs_node_id *id = cgid_start(iter->ent);

- trace_seq_printf(&iter->seq, "%3d,%-3d %x,%-x %2s %3s ",
+ if (blk_tracer_flags.val & TRACE_BLK_OPT_CGNAME) {
+ char blkcg_name_buf[NAME_MAX + 1] = "<...>";
+
+ cgroup_path_from_kernfs_id(id, blkcg_name_buf,
+ sizeof(blkcg_name_buf));
+ trace_seq_printf(&iter->seq, "%3d,%-3d %s %2s %3s ",
+ MAJOR(t->device), MINOR(t->device),
+ blkcg_name_buf, act, rwbs);
+ } else
+ trace_seq_printf(&iter->seq,
+ "%3d,%-3d %x,%-x %2s %3s ",
MAJOR(t->device), MINOR(t->device),
id->ino, id->generation, act, rwbs);
} else
--
2.9.3

2017-07-12 18:50:35

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 02/11] kernfs: implement i_generation

From: Shaohua Li <[email protected]>

Set i_generation for kernfs inode. This is required to implement
exportfs operations. The generation is 32-bit, so it's possible the
generation wraps up and we find stale files. To reduce the posssibility,
we don't reuse inode numer immediately. When the inode number allocation
wraps, we increase generation number. In this way generation/inode
number consist of a 64-bit number which is unlikely duplicated. This
does make the idr tree more sparse and waste some memory. Since idr
manages 32-bit keys, idr uses a 6-level radix tree, each level covers 6
bits of the key. In a 100k inode kernfs, the worst case will have around
300k radix tree node. Each node is 576bytes, so the tree will use about
~150M memory. Sounds not too bad, if this really is a problem, we should
find better data structure.

Acked-by: Tejun Heo <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 10 +++++++++-
fs/kernfs/inode.c | 1 +
include/linux/kernfs.h | 2 ++
3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8ad7a17..33f711f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -623,6 +623,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
unsigned flags)
{
struct kernfs_node *kn;
+ u32 gen;
+ int cursor;
int ret;

name = kstrdup_const(name, GFP_KERNEL);
@@ -635,12 +637,17 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,

idr_preload(GFP_KERNEL);
spin_lock(&kernfs_idr_lock);
- ret = idr_alloc(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+ cursor = idr_get_cursor(&root->ino_idr);
+ ret = idr_alloc_cyclic(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+ if (ret >= 0 && ret < cursor)
+ root->next_generation++;
+ gen = root->next_generation;
spin_unlock(&kernfs_idr_lock);
idr_preload_end();
if (ret < 0)
goto err_out2;
kn->ino = ret;
+ kn->generation = gen;

atomic_set(&kn->count, 1);
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
@@ -884,6 +891,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,

idr_init(&root->ino_idr);
INIT_LIST_HEAD(&root->supers);
+ root->next_generation = 1;

kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
KERNFS_DIR);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index fb4b4a7..79cdae4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,6 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
inode->i_private = kn;
inode->i_mapping->a_ops = &kernfs_aops;
inode->i_op = &kernfs_iops;
+ inode->i_generation = kn->generation;

set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5f5d602..8c00d28 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -135,6 +135,7 @@ struct kernfs_node {
umode_t mode;
unsigned int ino;
struct kernfs_iattrs *iattr;
+ u32 generation;
};

/*
@@ -164,6 +165,7 @@ struct kernfs_root {

/* private fields, do not use outside kernfs proper */
struct idr ino_idr;
+ u32 next_generation;
struct kernfs_syscall_ops *syscall_ops;

/* list of kernfs_super_info of this root, protected by kernfs_mutex */
--
2.9.3

2017-07-12 18:52:07

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 05/11] kernfs: introduce kernfs_node_id

From: Shaohua Li <[email protected]>

inode number and generation can identify a kernfs node. We are going to
export the identification by exportfs operations, so put ino and
generation into a separate structure. It's convenient when later patches
use the identification.

Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 10 +++++-----
fs/kernfs/file.c | 4 ++--
fs/kernfs/inode.c | 4 ++--
include/linux/cgroup.h | 2 +-
include/linux/kernfs.h | 12 ++++++++++--
include/trace/events/writeback.h | 2 +-
6 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index b61a7ef..89d1dc1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -539,7 +539,7 @@ void kernfs_put(struct kernfs_node *kn)
}
kfree(kn->iattr);
spin_lock(&kernfs_idr_lock);
- idr_remove(&root->ino_idr, kn->ino);
+ idr_remove(&root->ino_idr, kn->id.ino);
spin_unlock(&kernfs_idr_lock);
kmem_cache_free(kernfs_node_cache, kn);

@@ -645,8 +645,8 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
idr_preload_end();
if (ret < 0)
goto err_out2;
- kn->ino = ret;
- kn->generation = gen;
+ kn->id.ino = ret;
+ kn->id.generation = gen;

/*
* set ino first. This barrier is paired with atomic_inc_not_zero in
@@ -721,7 +721,7 @@ struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
* before 'count'. So if 'count' is uptodate, 'ino' should be uptodate,
* hence we can use 'ino' to filter stale node.
*/
- if (kn->ino != ino)
+ if (kn->id.ino != ino)
goto out;
rcu_read_unlock();

@@ -1654,7 +1654,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
const char *name = pos->name;
unsigned int type = dt_type(pos);
int len = strlen(name);
- ino_t ino = pos->ino;
+ ino_t ino = pos->id.ino;

ctx->pos = pos->hash;
file->private_data = pos;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 7f90d4d..7441925 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -895,7 +895,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
* have the matching @file available. Look up the inodes
* and generate the events manually.
*/
- inode = ilookup(info->sb, kn->ino);
+ inode = ilookup(info->sb, kn->id.ino);
if (!inode)
continue;

@@ -903,7 +903,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
if (parent) {
struct inode *p_inode;

- p_inode = ilookup(info->sb, parent->ino);
+ p_inode = ilookup(info->sb, parent->id.ino);
if (p_inode) {
fsnotify(p_inode, FS_MODIFY | FS_EVENT_ON_CHILD,
inode, FSNOTIFY_EVENT_INODE, kn->name, 0);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 4c8b510..a343039 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -220,7 +220,7 @@ static void kernfs_init_inode(struct kernfs_node *kn, struct inode *inode)
inode->i_private = kn;
inode->i_mapping->a_ops = &kernfs_aops;
inode->i_op = &kernfs_iops;
- inode->i_generation = kn->generation;
+ inode->i_generation = kn->id.generation;

set_default_inode_attr(inode, kn->mode);
kernfs_refresh_inode(kn, inode);
@@ -266,7 +266,7 @@ struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn)
{
struct inode *inode;

- inode = iget_locked(sb, kn->ino);
+ inode = iget_locked(sb, kn->id.ino);
if (inode && (inode->i_state & I_NEW))
kernfs_init_inode(kn, inode);

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 710a005..30c6877 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -543,7 +543,7 @@ static inline bool cgroup_is_populated(struct cgroup *cgrp)
/* returns ino associated with a cgroup */
static inline ino_t cgroup_ino(struct cgroup *cgrp)
{
- return cgrp->kn->ino;
+ return cgrp->kn->id.ino;
}

/* cft/css accessors for cftype->write() operation */
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 8c00d28..06a0c59 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -95,6 +95,15 @@ struct kernfs_elem_attr {
struct kernfs_node *notify_next; /* for kernfs_notify() */
};

+/* represent a kernfs node */
+union kernfs_node_id {
+ struct {
+ u32 ino;
+ u32 generation;
+ };
+ u64 id;
+};
+
/*
* kernfs_node - the building block of kernfs hierarchy. Each and every
* kernfs node is represented by single kernfs_node. Most fields are
@@ -131,11 +140,10 @@ struct kernfs_node {

void *priv;

+ union kernfs_node_id id;
unsigned short flags;
umode_t mode;
- unsigned int ino;
struct kernfs_iattrs *iattr;
- u32 generation;
};

/*
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 7bd8783..9b57f01 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -136,7 +136,7 @@ DEFINE_EVENT(writeback_dirty_inode_template, writeback_dirty_inode,

static inline unsigned int __trace_wb_assign_cgroup(struct bdi_writeback *wb)
{
- return wb->memcg_css->cgroup->kn->ino;
+ return wb->memcg_css->cgroup->kn->id.ino;
}

static inline unsigned int __trace_wbc_assign_cgroup(struct writeback_control *wbc)
--
2.9.3

2017-07-12 18:52:28

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 04/11] kernfs: don't set dentry->d_fsdata

From: Shaohua Li <[email protected]>

When working on adding exportfs operations in kernfs, I found it's hard
to initialize dentry->d_fsdata in the exportfs operations. Looks there
is no way to do it without race condition. Look at the kernfs code
closely, there is no point to set dentry->d_fsdata. inode->i_private
already points to kernfs_node, and we can get inode from a dentry. So
this patch just delete the d_fsdata usage.

Acked-by: Tejun Heo <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 25 +++++++++----------------
fs/kernfs/file.c | 6 +++---
fs/kernfs/inode.c | 6 +++---
fs/kernfs/kernfs-internal.h | 7 +++++++
fs/kernfs/mount.c | 8 ++------
fs/kernfs/symlink.c | 6 +++---
6 files changed, 27 insertions(+), 31 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7be37c8..b61a7ef 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -566,7 +566,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
if (d_really_is_negative(dentry))
goto out_bad_unlocked;

- kn = dentry->d_fsdata;
+ kn = kernfs_dentry_node(dentry);
mutex_lock(&kernfs_mutex);

/* The kernfs node has been deactivated */
@@ -574,7 +574,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
goto out_bad;

/* The kernfs node has been moved? */
- if (dentry->d_parent->d_fsdata != kn->parent)
+ if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
goto out_bad;

/* The kernfs node has been renamed */
@@ -594,14 +594,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
return 0;
}

-static void kernfs_dop_release(struct dentry *dentry)
-{
- kernfs_put(dentry->d_fsdata);
-}
-
const struct dentry_operations kernfs_dops = {
.d_revalidate = kernfs_dop_revalidate,
- .d_release = kernfs_dop_release,
};

/**
@@ -617,8 +611,9 @@ const struct dentry_operations kernfs_dops = {
*/
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
{
- if (dentry->d_sb->s_op == &kernfs_sops)
- return dentry->d_fsdata;
+ if (dentry->d_sb->s_op == &kernfs_sops &&
+ !d_really_is_negative(dentry))
+ return kernfs_dentry_node(dentry);
return NULL;
}

@@ -1056,7 +1051,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
unsigned int flags)
{
struct dentry *ret;
- struct kernfs_node *parent = dentry->d_parent->d_fsdata;
+ struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
struct inode *inode;
const void *ns = NULL;
@@ -1073,8 +1068,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
ret = NULL;
goto out_unlock;
}
- kernfs_get(kn);
- dentry->d_fsdata = kn;

/* attach dentry and inode */
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1111,7 +1104,7 @@ static int kernfs_iop_mkdir(struct inode *dir, struct dentry *dentry,

static int kernfs_iop_rmdir(struct inode *dir, struct dentry *dentry)
{
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(dentry);
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;

@@ -1131,7 +1124,7 @@ static int kernfs_iop_rename(struct inode *old_dir, struct dentry *old_dentry,
struct inode *new_dir, struct dentry *new_dentry,
unsigned int flags)
{
- struct kernfs_node *kn = old_dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(old_dentry);
struct kernfs_node *new_parent = new_dir->i_private;
struct kernfs_syscall_ops *scops = kernfs_root(kn)->syscall_ops;
int ret;
@@ -1644,7 +1637,7 @@ static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file->f_path.dentry;
- struct kernfs_node *parent = dentry->d_fsdata;
+ struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
const void *ns = NULL;

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index ac2dfe0..7f90d4d 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -616,7 +616,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,

static int kernfs_fop_open(struct inode *inode, struct file *file)
{
- struct kernfs_node *kn = file->f_path.dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root = kernfs_root(kn);
const struct kernfs_ops *ops;
struct kernfs_open_file *of;
@@ -768,7 +768,7 @@ static void kernfs_release_file(struct kernfs_node *kn,

static int kernfs_fop_release(struct inode *inode, struct file *filp)
{
- struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
struct kernfs_open_file *of = kernfs_of(filp);

if (kn->flags & KERNFS_HAS_RELEASE) {
@@ -835,7 +835,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
static unsigned int kernfs_fop_poll(struct file *filp, poll_table *wait)
{
struct kernfs_open_file *of = kernfs_of(filp);
- struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(filp->f_path.dentry);
struct kernfs_open_node *on = kn->attr.open;

if (!kernfs_get_active(kn))
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 79cdae4..4c8b510 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -112,7 +112,7 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr)
{
struct inode *inode = d_inode(dentry);
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
int error;

if (!kn)
@@ -154,7 +154,7 @@ static int kernfs_node_setsecdata(struct kernfs_iattrs *attrs, void **secdata,

ssize_t kernfs_iop_listxattr(struct dentry *dentry, char *buf, size_t size)
{
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = kernfs_dentry_node(dentry);
struct kernfs_iattrs *attrs;

attrs = kernfs_iattrs(kn);
@@ -203,8 +203,8 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
int kernfs_iop_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
- struct kernfs_node *kn = path->dentry->d_fsdata;
struct inode *inode = d_inode(path->dentry);
+ struct kernfs_node *kn = inode->i_private;

mutex_lock(&kernfs_mutex);
kernfs_refresh_inode(kn, inode);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index e9c226f..0f260dc 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -70,6 +70,13 @@ struct kernfs_super_info {
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))

+static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
+{
+ if (d_really_is_negative(dentry))
+ return NULL;
+ return d_inode(dentry)->i_private;
+}
+
extern const struct super_operations kernfs_sops;
extern struct kmem_cache *kernfs_node_cache;

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 69c48be..acd5426 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -33,7 +33,7 @@ static int kernfs_sop_remount_fs(struct super_block *sb, int *flags, char *data)

static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)
{
- struct kernfs_root *root = kernfs_root(dentry->d_fsdata);
+ struct kernfs_root *root = kernfs_root(kernfs_dentry_node(dentry));
struct kernfs_syscall_ops *scops = root->syscall_ops;

if (scops && scops->show_options)
@@ -43,7 +43,7 @@ static int kernfs_sop_show_options(struct seq_file *sf, struct dentry *dentry)

static int kernfs_sop_show_path(struct seq_file *sf, struct dentry *dentry)
{
- struct kernfs_node *node = dentry->d_fsdata;
+ struct kernfs_node *node = kernfs_dentry_node(dentry);
struct kernfs_root *root = kernfs_root(node);
struct kernfs_syscall_ops *scops = root->syscall_ops;

@@ -176,8 +176,6 @@ static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
pr_debug("%s: could not get root dentry!\n", __func__);
return -ENOMEM;
}
- kernfs_get(info->root->kn);
- root->d_fsdata = info->root->kn;
sb->s_root = root;
sb->s_d_op = &kernfs_dops;
return 0;
@@ -283,7 +281,6 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
void kernfs_kill_sb(struct super_block *sb)
{
struct kernfs_super_info *info = kernfs_info(sb);
- struct kernfs_node *root_kn = sb->s_root->d_fsdata;

mutex_lock(&kernfs_mutex);
list_del(&info->node);
@@ -295,7 +292,6 @@ void kernfs_kill_sb(struct super_block *sb)
*/
kill_anon_super(sb);
kfree(info);
- kernfs_put(root_kn);
}

/**
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 1684af4..08ccabd 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -98,9 +98,9 @@ static int kernfs_get_target_path(struct kernfs_node *parent,
return 0;
}

-static int kernfs_getlink(struct dentry *dentry, char *path)
+static int kernfs_getlink(struct inode *inode, char *path)
{
- struct kernfs_node *kn = dentry->d_fsdata;
+ struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
int error;
@@ -124,7 +124,7 @@ static const char *kernfs_iop_get_link(struct dentry *dentry,
body = kzalloc(PAGE_SIZE, GFP_KERNEL);
if (!body)
return ERR_PTR(-ENOMEM);
- error = kernfs_getlink(dentry, body);
+ error = kernfs_getlink(inode, body);
if (unlikely(error < 0)) {
kfree(body);
return ERR_PTR(error);
--
2.9.3

2017-07-12 18:52:29

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 03/11] kernfs: add an API to get kernfs node from inode number

From: Shaohua Li <[email protected]>

Add an API to get kernfs node from inode number. We will need this to
implement exportfs operations.

This API will be used in blktrace too later, so it should be as fast as
possible. To make the API lock free, kernfs node is freed in RCU
context. And we depend on kernfs_node count/ino number to filter out
stale kernfs nodes.

Acked-by: Tejun Heo <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
fs/kernfs/kernfs-internal.h | 2 ++
fs/kernfs/mount.c | 11 ++++++++-
3 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 33f711f..7be37c8 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -508,6 +508,10 @@ void kernfs_put(struct kernfs_node *kn)
struct kernfs_node *parent;
struct kernfs_root *root;

+ /*
+ * kernfs_node is freed with ->count 0, kernfs_find_and_get_node_by_ino
+ * depends on this to filter reused stale node
+ */
if (!kn || !atomic_dec_and_test(&kn->count))
return;
root = kernfs_root(kn);
@@ -649,6 +653,11 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
kn->ino = ret;
kn->generation = gen;

+ /*
+ * set ino first. This barrier is paired with atomic_inc_not_zero in
+ * kernfs_find_and_get_node_by_ino
+ */
+ smp_mb__before_atomic();
atomic_set(&kn->count, 1);
atomic_set(&kn->active, KN_DEACTIVATED_BIAS);
RB_CLEAR_NODE(&kn->rb);
@@ -680,6 +689,54 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
return kn;
}

+/*
+ * kernfs_find_and_get_node_by_ino - get kernfs_node from inode number
+ * @root: the kernfs root
+ * @ino: inode number
+ *
+ * RETURNS:
+ * NULL on failure. Return a kernfs node with reference counter incremented
+ */
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+ unsigned int ino)
+{
+ struct kernfs_node *kn;
+
+ rcu_read_lock();
+ kn = idr_find(&root->ino_idr, ino);
+ if (!kn)
+ goto out;
+
+ /*
+ * Since kernfs_node is freed in RCU, it's possible an old node for ino
+ * is freed, but reused before RCU grace period. But a freed node (see
+ * kernfs_put) or an incompletedly initialized node (see
+ * __kernfs_new_node) should have 'count' 0. We can use this fact to
+ * filter out such node.
+ */
+ if (!atomic_inc_not_zero(&kn->count)) {
+ kn = NULL;
+ goto out;
+ }
+
+ /*
+ * The node could be a new node or a reused node. If it's a new node,
+ * we are ok. If it's reused because of RCU (because of
+ * SLAB_TYPESAFE_BY_RCU), the __kernfs_new_node always sets its 'ino'
+ * before 'count'. So if 'count' is uptodate, 'ino' should be uptodate,
+ * hence we can use 'ino' to filter stale node.
+ */
+ if (kn->ino != ino)
+ goto out;
+ rcu_read_unlock();
+
+ return kn;
+out:
+ rcu_read_unlock();
+ kernfs_put(kn);
+ return NULL;
+}
+
/**
* kernfs_add_one - add kernfs_node to parent without warning
* @kn: kernfs_node to be added
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 2d5144a..e9c226f 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -98,6 +98,8 @@ int kernfs_add_one(struct kernfs_node *kn);
struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
const char *name, umode_t mode,
unsigned flags);
+struct kernfs_node *kernfs_find_and_get_node_by_ino(struct kernfs_root *root,
+ unsigned int ino);

/*
* file.c
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d5b149a..69c48be 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -330,7 +330,16 @@ struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns)

void __init kernfs_init(void)
{
+
+ /*
+ * the slab is freed in RCU context, so kernfs_find_and_get_node_by_ino
+ * can access the slab lock free. This could introduce stale nodes,
+ * please see how kernfs_find_and_get_node_by_ino filters out stale
+ * nodes.
+ */
kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
sizeof(struct kernfs_node),
- 0, SLAB_PANIC, NULL);
+ 0,
+ SLAB_PANIC | SLAB_TYPESAFE_BY_RCU,
+ NULL);
}
--
2.9.3

2017-07-12 18:52:59

by Shaohua Li

[permalink] [raw]
Subject: [PATCH V5 01/11] kernfs: use idr instead of ida to manage inode number

From: Shaohua Li <[email protected]>

kernfs uses ida to manage inode number. The problem is we can't get
kernfs_node from inode number with ida. Switching to use idr, next patch
will add an API to get kernfs_node from inode number.

Acked-by: Tejun Heo <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Shaohua Li <[email protected]>
---
fs/kernfs/dir.c | 17 ++++++++++++-----
include/linux/kernfs.h | 2 +-
2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index db5900aaa..8ad7a17 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -21,6 +21,7 @@
DEFINE_MUTEX(kernfs_mutex);
static DEFINE_SPINLOCK(kernfs_rename_lock); /* kn->parent and ->name */
static char kernfs_pr_cont_buf[PATH_MAX]; /* protected by rename_lock */
+static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

#define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)

@@ -533,7 +534,9 @@ void kernfs_put(struct kernfs_node *kn)
simple_xattrs_free(&kn->iattr->xattrs);
}
kfree(kn->iattr);
- ida_simple_remove(&root->ino_ida, kn->ino);
+ spin_lock(&kernfs_idr_lock);
+ idr_remove(&root->ino_idr, kn->ino);
+ spin_unlock(&kernfs_idr_lock);
kmem_cache_free(kernfs_node_cache, kn);

kn = parent;
@@ -542,7 +545,7 @@ void kernfs_put(struct kernfs_node *kn)
goto repeat;
} else {
/* just released the root kn, free @root too */
- ida_destroy(&root->ino_ida);
+ idr_destroy(&root->ino_idr);
kfree(root);
}
}
@@ -630,7 +633,11 @@ static struct kernfs_node *__kernfs_new_node(struct kernfs_root *root,
if (!kn)
goto err_out1;

- ret = ida_simple_get(&root->ino_ida, 1, 0, GFP_KERNEL);
+ idr_preload(GFP_KERNEL);
+ spin_lock(&kernfs_idr_lock);
+ ret = idr_alloc(&root->ino_idr, kn, 1, 0, GFP_ATOMIC);
+ spin_unlock(&kernfs_idr_lock);
+ idr_preload_end();
if (ret < 0)
goto err_out2;
kn->ino = ret;
@@ -875,13 +882,13 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
if (!root)
return ERR_PTR(-ENOMEM);

- ida_init(&root->ino_ida);
+ idr_init(&root->ino_idr);
INIT_LIST_HEAD(&root->supers);

kn = __kernfs_new_node(root, "", S_IFDIR | S_IRUGO | S_IXUGO,
KERNFS_DIR);
if (!kn) {
- ida_destroy(&root->ino_ida);
+ idr_destroy(&root->ino_idr);
kfree(root);
return ERR_PTR(-ENOMEM);
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index a9b11b8..5f5d602 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -163,7 +163,7 @@ struct kernfs_root {
unsigned int flags; /* KERNFS_ROOT_* flags */

/* private fields, do not use outside kernfs proper */
- struct ida ino_ida;
+ struct idr ino_idr;
struct kernfs_syscall_ops *syscall_ops;

/* list of kernfs_super_info of this root, protected by kernfs_mutex */
--
2.9.3

2017-07-12 22:51:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH V5 00/11] blktrace: output cgroup info

On 07/12/2017 12:49 PM, Shaohua Li wrote:
> From: Shaohua Li <[email protected]>
>
> Hi,
>
> Currently blktrace isn't cgroup aware. blktrace prints out task name of current
> context, but the task of current context isn't always in the cgroup where the
> BIO comes from. We can't use task name to find out IO cgroup. For example,
> Writeback BIOs always comes from flusher thread but the BIOs are for different
> blk cgroups. Request could be requeued and dispatched from completely different
> tasks. MD/DM are another examples. This brings challenges if we want to use
> blktrace for performance tunning with cgroup enabled.
>
> This patchset try to fix the gap. We print out cgroup fhandle info in blktrace.
> Userspace can use open_by_handle_at() syscall to find the cgroup by fhandle. Or
> userspace can use name_to_handle_at() syscall to find fhandle for a cgroup and
> use a BPF program to filter out blktrace for a specific cgroup.
>
> The first 6 patches adds export operation handlers for kernfs, so userspace can
> use open_by_handle_at/name_to_handle_at to a kernfs file. Later patches make
> blktrace output cgroup info.

Added for 4.14, thanks Shaohua.

--
Jens Axboe