2023-10-04 20:50:57

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode

From: "Steven Rostedt (Google)" <[email protected]>

Instead of having a descriptor for every file represented in the eventfs
directory, only have the directory itself represented. Change the API to
send in a list of entries that represent all the files in the directory
(but not other directories). The entry list contains a name and a callback
function that will be used to create the files when they are accessed.

struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
const struct eventfs_entry *entries,
int size, void *data);

is used for the top level eventfs directory, and returns an eventfs_inode
that will be used by:

struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent,
const struct eventfs_entry *entries,
int size, void *data);

where both of the above take an array of struct eventfs_entry entries for
every file that is in the directory.

The entries are defined by:

typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
const struct file_operations **fops);

struct eventfs_entry {
const char *name;
eventfs_callback callback;
};

Where the name is the name of the file and the callback gets called when
the file is being created. The callback passes in the name (in case the
same callback is used for multiple files), a pointer to the mode, data and
fops. The data will be pointing to the data that was passed in
eventfs_create_dir() or eventfs_create_events_dir() but may be overridden
to point to something else, as it will be used to point to the
inode->i_private that is created. The information passed back from the
callback is used to create the dentry/inode.

If the callback fills the data and the file should be created, it must
return a positive number. On zero or negative, the file is ignored.

This logic may also be used as a prototype to convert entire pseudo file
systems into just-in-time allocation.

The "show_events_dentry" file has been updated to show the directories,
and any files they have.

With just the eventfs_file allocations:

Before after deltas for meminfo (in kB):

MemFree: -14360
MemAvailable: -14260
Buffers: 40
Cached: 24
Active: 44
Inactive: 48
Inactive(anon): 28
Active(file): 44
Inactive(file): 20
Dirty: -4
AnonPages: 28
Mapped: 4
KReclaimable: 132
Slab: 1604
SReclaimable: 132
SUnreclaim: 1472
Committed_AS: 12

Before after deltas for slabinfo:

<slab>: <objects> [ * <size> = <total>]

ext4_inode_cache 27 [* 1184 = 31968 ]
extent_status 102 [* 40 = 4080 ]
tracefs_inode_cache 144 [* 656 = 94464 ]
buffer_head 39 [* 104 = 4056 ]
shmem_inode_cache 49 [* 800 = 39200 ]
filp -53 [* 256 = -13568 ]
dentry 251 [* 192 = 48192 ]
lsm_file_cache 277 [* 32 = 8864 ]
vm_area_struct -14 [* 184 = -2576 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k 35 [* 1024 = 35840 ]
kmalloc-256 49 [* 256 = 12544 ]
kmalloc-192 -28 [* 192 = -5376 ]
kmalloc-128 -30 [* 128 = -3840 ]
kmalloc-96 10581 [* 96 = 1015776 ]
kmalloc-64 3056 [* 64 = 195584 ]
kmalloc-32 1291 [* 32 = 41312 ]
kmalloc-16 2310 [* 16 = 36960 ]
kmalloc-8 9216 [* 8 = 73728 ]

Free memory dropped by 14,360 kB
Available memory dropped by 14,260 kB
Total slab additions in size: 1,771,032 bytes

With this change:

Before after deltas for meminfo (in kB):

MemFree: -12084
MemAvailable: -11976
Buffers: 32
Cached: 32
Active: 72
Inactive: 168
Inactive(anon): 176
Active(file): 72
Inactive(file): -8
Dirty: 24
AnonPages: 196
Mapped: 8
KReclaimable: 148
Slab: 836
SReclaimable: 148
SUnreclaim: 688
Committed_AS: 324

Before after deltas for slabinfo:

<slab>: <objects> [ * <size> = <total>]

tracefs_inode_cache 144 [* 656 = 94464 ]
shmem_inode_cache -23 [* 800 = -18400 ]
filp -92 [* 256 = -23552 ]
dentry 179 [* 192 = 34368 ]
lsm_file_cache -3 [* 32 = -96 ]
vm_area_struct -13 [* 184 = -2392 ]
trace_event_file 1748 [* 88 = 153824 ]
kmalloc-1k -49 [* 1024 = -50176 ]
kmalloc-256 -27 [* 256 = -6912 ]
kmalloc-128 1864 [* 128 = 238592 ]
kmalloc-64 4685 [* 64 = 299840 ]
kmalloc-32 -72 [* 32 = -2304 ]
kmalloc-16 256 [* 16 = 4096 ]
total = 721352

Free memory dropped by 12,084 kB
Available memory dropped by 11,976 kB
Total slab additions in size: 721,352 bytes

That's over 2 MB in savings per instance for free and available memory,
and over 1 MB in savings per instance of slab memory.

Link: https://lore.kernel.org/linux-trace-kernel/[email protected]

Cc: Masami Hiramatsu <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Ajay Kaher <[email protected]>
Signed-off-by: Steven Rostedt (Google) <[email protected]>
---
Changes since v4: https://lore.kernel.org/linux-trace-kernel/[email protected]/

- Get the ei->dentry within the eventfs_mutex to keep consistency during the lookup.

fs/tracefs/event_inode.c | 847 ++++++++++++++++++-----------------
fs/tracefs/inode.c | 2 +-
fs/tracefs/internal.h | 37 +-
include/linux/trace_events.h | 2 +-
include/linux/tracefs.h | 29 +-
kernel/trace/trace.c | 7 +-
kernel/trace/trace.h | 4 +-
kernel/trace/trace_events.c | 313 +++++++++----
8 files changed, 705 insertions(+), 536 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 8c8d64e76103..eab18b157ef5 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -2,8 +2,9 @@
/*
* event_inode.c - part of tracefs, a pseudo file system for activating tracing
*
- * Copyright (C) 2020-23 VMware Inc, author: Steven Rostedt (VMware) <[email protected]>
+ * Copyright (C) 2020-23 VMware Inc, author: Steven Rostedt <[email protected]>
* Copyright (C) 2020-23 VMware Inc, author: Ajay Kaher <[email protected]>
+ * Copyright (C) 2023 Google, author: Steven Rostedt <[email protected]>
*
* eventfs is used to dynamically create inodes and dentries based on the
* meta data provided by the tracing system.
@@ -23,46 +24,6 @@
#include <linux/delay.h>
#include "internal.h"

-struct eventfs_inode {
- struct list_head e_top_files;
-};
-
-/*
- * struct eventfs_file - hold the properties of the eventfs files and
- * directories.
- * @name: the name of the file or directory to create
- * @d_parent: holds parent's dentry
- * @dentry: once accessed holds dentry
- * @list: file or directory to be added to parent directory
- * @ei: list of files and directories within directory
- * @fop: file_operations for file or directory
- * @iop: inode_operations for file or directory
- * @data: something that the caller will want to get to later on
- * @mode: the permission that the file or directory should have
- */
-struct eventfs_file {
- const char *name;
- struct dentry *d_parent;
- struct dentry *dentry;
- struct list_head list;
- struct eventfs_inode *ei;
- const struct file_operations *fop;
- const struct inode_operations *iop;
- /*
- * Union - used for deletion
- * @del_list: list of eventfs_file to delete
- * @rcu: eventfs_file to delete in RCU
- * @is_freed: node is freed if one of the above is set
- */
- union {
- struct list_head del_list;
- struct rcu_head rcu;
- unsigned long is_freed;
- };
- void *data;
- umode_t mode;
-};
-
static DEFINE_MUTEX(eventfs_mutex);
DEFINE_STATIC_SRCU(eventfs_srcu);

@@ -93,16 +54,9 @@ static const struct file_operations eventfs_file_operations = {
* @data: something that the caller will want to get to later on.
* @fop: struct file_operations that should be used for this file.
*
- * This is the basic "create a file" function for tracefs. It allows for a
- * wide range of flexibility in creating a file.
- *
- * This function will return a pointer to a dentry if it succeeds. This
- * pointer must be passed to the tracefs_remove() function when the file is
- * to be removed (no automatic cleanup happens if your module is unloaded,
- * you are responsible here.) If an error occurs, %NULL will be returned.
- *
- * If tracefs is not enabled in the kernel, the value -%ENODEV will be
- * returned.
+ * This function creates a dentry that represents a file in the eventsfs_inode
+ * directory. The inode.i_private pointer will point to @data in the open()
+ * call.
*/
static struct dentry *create_file(const char *name, umode_t mode,
struct dentry *parent, void *data,
@@ -118,6 +72,7 @@ static struct dentry *create_file(const char *name, umode_t mode,
if (WARN_ON_ONCE(!S_ISREG(mode)))
return NULL;

+ WARN_ON_ONCE(!parent);
dentry = eventfs_start_creating(name, parent);

if (IS_ERR(dentry))
@@ -142,20 +97,11 @@ static struct dentry *create_file(const char *name, umode_t mode,
* create_dir - create a dir in the tracefs filesystem
* @name: the name of the file to create.
* @parent: parent dentry for this file.
- * @data: something that the caller will want to get to later on.
- *
- * This is the basic "create a dir" function for eventfs. It allows for a
- * wide range of flexibility in creating a dir.
- *
- * This function will return a pointer to a dentry if it succeeds. This
- * pointer must be passed to the tracefs_remove() function when the file is
- * to be removed (no automatic cleanup happens if your module is unloaded,
- * you are responsible here.) If an error occurs, %NULL will be returned.
*
- * If tracefs is not enabled in the kernel, the value -%ENODEV will be
- * returned.
+ * This function will create a dentry for a directory represented by
+ * a eventfs_inode.
*/
-static struct dentry *create_dir(const char *name, struct dentry *parent, void *data)
+static struct dentry *create_dir(const char *name, struct dentry *parent)
{
struct tracefs_inode *ti;
struct dentry *dentry;
@@ -172,7 +118,6 @@ static struct dentry *create_dir(const char *name, struct dentry *parent, void *
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_op = &eventfs_root_dir_inode_operations;
inode->i_fop = &eventfs_file_operations;
- inode->i_private = data;

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;
@@ -185,18 +130,18 @@ static struct dentry *create_dir(const char *name, struct dentry *parent, void *
}

/**
- * eventfs_set_ef_status_free - set the ef->status to free
+ * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
* @ti: the tracefs_inode of the dentry
- * @dentry: dentry who's status to be freed
+ * @dentry: dentry which has the reference to remove.
*
- * eventfs_set_ef_status_free will be called if no more
- * references remain
+ * Remove the association between a dentry from an eventfs_inode.
*/
-void eventfs_set_ef_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
{
struct tracefs_inode *ti_parent;
+ struct eventfs_inode *ei_child, *tmp;
struct eventfs_inode *ei;
- struct eventfs_file *ef, *tmp;
+ int i;

/* The top level events directory may be freed by this */
if (unlikely(ti->flags & TRACEFS_EVENT_TOP_INODE)) {
@@ -207,9 +152,9 @@ void eventfs_set_ef_status_free(struct tracefs_inode *ti, struct dentry *dentry)
ei = ti->private;

/* Record all the top level files */
- list_for_each_entry_srcu(ef, &ei->e_top_files, list,
+ list_for_each_entry_srcu(ei_child, &ei->children, list,
lockdep_is_held(&eventfs_mutex)) {
- list_add_tail(&ef->del_list, &ef_del_list);
+ list_add_tail(&ei_child->del_list, &ef_del_list);
}

/* Nothing should access this, but just in case! */
@@ -218,11 +163,13 @@ void eventfs_set_ef_status_free(struct tracefs_inode *ti, struct dentry *dentry)
mutex_unlock(&eventfs_mutex);

/* Now safely free the top level files and their children */
- list_for_each_entry_safe(ef, tmp, &ef_del_list, del_list) {
- list_del(&ef->del_list);
- eventfs_remove(ef);
+ list_for_each_entry_safe(ei_child, tmp, &ef_del_list, del_list) {
+ list_del(&ei_child->del_list);
+ eventfs_remove_dir(ei_child);
}

+ kfree_const(ei->name);
+ kfree(ei->d_children);
kfree(ei);
return;
}
@@ -233,68 +180,162 @@ void eventfs_set_ef_status_free(struct tracefs_inode *ti, struct dentry *dentry)
if (!ti_parent || !(ti_parent->flags & TRACEFS_EVENT_INODE))
goto out;

- ef = dentry->d_fsdata;
- if (!ef)
+ ei = dentry->d_fsdata;
+ if (!ei)
goto out;

/*
- * If ef was freed, then the LSB bit is set for d_fsdata.
+ * If ei was freed, then the LSB bit is set for d_fsdata.
* But this should not happen, as it should still have a
* ref count that prevents it. Warn in case it does.
*/
- if (WARN_ON_ONCE((unsigned long)ef & 1))
+ if (WARN_ON_ONCE((unsigned long)ei & 1))
goto out;

+ /* This could belong to one of the files of the ei */
+ if (ei->dentry != dentry) {
+ for (i = 0; i < ei->nr_entries; i++) {
+ if (ei->d_children[i] == dentry)
+ break;
+ }
+ if (WARN_ON_ONCE(i == ei->nr_entries))
+ goto out;
+ ei->d_children[i] = NULL;
+ } else {
+ ei->dentry = NULL;
+ }
+
dentry->d_fsdata = NULL;
- ef->dentry = NULL;
-out:
+ out:
mutex_unlock(&eventfs_mutex);
}

+/**
+ * create_file_dentry - create a dentry for a file of an eventfs_inode
+ * @ei: the eventfs_inode that the file will be created under
+ * @e_dentry: a pointer to the d_children[] of the @ei
+ * @parent: The parent dentry of the created file.
+ * @name: The name of the file to create
+ * @mode: The mode of the file.
+ * @data: The data to use to set the inode of the file with on open()
+ * @fops: The fops of the file to be created.
+ * @lookup: If called by the lookup routine, in which case, dput() the created dentry.
+ *
+ * Create a dentry for a file of an eventfs_inode @ei and place it into the
+ * address located at @e_dentry. If the @e_dentry already has a dentry, then
+ * just do a dget() on it and return. Otherwise create the dentry and attach it.
+ */
+static struct dentry *
+create_file_dentry(struct eventfs_inode *ei, struct dentry **e_dentry,
+ struct dentry *parent, const char *name, umode_t mode, void *data,
+ const struct file_operations *fops, bool lookup)
+{
+ struct dentry *dentry;
+ bool invalidate = false;
+
+ mutex_lock(&eventfs_mutex);
+ /* If the e_dentry already has a dentry, use it */
+ if (*e_dentry) {
+ /* lookup does not need to up the ref count */
+ if (!lookup)
+ dget(*e_dentry);
+ mutex_unlock(&eventfs_mutex);
+ return *e_dentry;
+ }
+ mutex_unlock(&eventfs_mutex);
+
+ /* The lookup already has the parent->d_inode locked */
+ if (!lookup)
+ inode_lock(parent->d_inode);
+
+ dentry = create_file(name, mode, parent, data, fops);
+
+ if (!lookup)
+ inode_unlock(parent->d_inode);
+
+ mutex_lock(&eventfs_mutex);
+
+ if (IS_ERR_OR_NULL(dentry)) {
+ /*
+ * When the mutex was released, something else could have
+ * created the dentry for this e_dentry. In which case
+ * use that one.
+ *
+ * Note, with the mutex held, the e_dentry cannot have content
+ * and the ei->is_freed be true at the same time.
+ */
+ WARN_ON_ONCE(ei->is_freed);
+ dentry = *e_dentry;
+ /* The lookup does not need to up the dentry refcount */
+ if (dentry && !lookup)
+ dget(dentry);
+ mutex_unlock(&eventfs_mutex);
+ return dentry;
+ }
+
+ if (!*e_dentry && !ei->is_freed) {
+ *e_dentry = dentry;
+ dentry->d_fsdata = ei;
+ } else {
+ /*
+ * Should never happen unless we get here due to being freed.
+ * Otherwise it means two dentries exist with the same name.
+ */
+ WARN_ON_ONCE(!ei->is_freed);
+ invalidate = true;
+ }
+ mutex_unlock(&eventfs_mutex);
+
+ if (invalidate)
+ d_invalidate(dentry);
+
+ if (lookup || invalidate)
+ dput(dentry);
+
+ return invalidate ? NULL : dentry;
+}
+
/**
* eventfs_post_create_dir - post create dir routine
- * @ef: eventfs_file of recently created dir
+ * @ei: eventfs_inode of recently created dir
*
* Map the meta-data of files within an eventfs dir to their parent dentry
*/
-static void eventfs_post_create_dir(struct eventfs_file *ef)
+static void eventfs_post_create_dir(struct eventfs_inode *ei)
{
- struct eventfs_file *ef_child;
+ struct eventfs_inode *ei_child;
struct tracefs_inode *ti;

/* srcu lock already held */
/* fill parent-child relation */
- list_for_each_entry_srcu(ef_child, &ef->ei->e_top_files, list,
+ list_for_each_entry_srcu(ei_child, &ei->children, list,
srcu_read_lock_held(&eventfs_srcu)) {
- ef_child->d_parent = ef->dentry;
+ ei_child->d_parent = ei->dentry;
}

- ti = get_tracefs(ef->dentry->d_inode);
- ti->private = ef->ei;
+ ti = get_tracefs(ei->dentry->d_inode);
+ ti->private = ei;
}

/**
- * create_dentry - helper function to create dentry
- * @ef: eventfs_file of file or directory to create
- * @parent: parent dentry
- * @lookup: true if called from lookup routine
+ * create_dir_dentry - Create a directory dentry for the eventfs_inode
+ * @ei: The eventfs_inode to create the directory for
+ * @parent: The dentry of the parent of this directory
+ * @lookup: True if this is called by the lookup code
*
- * Used to create a dentry for file/dir, executes post dentry creation routine
+ * This creates and attaches a directory dentry to the eventfs_inode @ei.
*/
static struct dentry *
-create_dentry(struct eventfs_file *ef, struct dentry *parent, bool lookup)
+create_dir_dentry(struct eventfs_inode *ei, struct dentry *parent, bool lookup)
{
bool invalidate = false;
- struct dentry *dentry;
+ struct dentry *dentry = NULL;

mutex_lock(&eventfs_mutex);
- if (ef->is_freed) {
- mutex_unlock(&eventfs_mutex);
- return NULL;
- }
- if (ef->dentry) {
- dentry = ef->dentry;
- /* On dir open, up the ref count */
+ if (ei->dentry) {
+ /* If the dentry already has a dentry, use it */
+ dentry = ei->dentry;
+ /* lookup does not need to up the ref count */
if (!lookup)
dget(dentry);
mutex_unlock(&eventfs_mutex);
@@ -302,42 +343,44 @@ create_dentry(struct eventfs_file *ef, struct dentry *parent, bool lookup)
}
mutex_unlock(&eventfs_mutex);

+ /* The lookup already has the parent->d_inode locked */
if (!lookup)
inode_lock(parent->d_inode);

- if (ef->ei)
- dentry = create_dir(ef->name, parent, ef->data);
- else
- dentry = create_file(ef->name, ef->mode, parent,
- ef->data, ef->fop);
+ dentry = create_dir(ei->name, parent);

if (!lookup)
inode_unlock(parent->d_inode);

mutex_lock(&eventfs_mutex);
- if (IS_ERR_OR_NULL(dentry)) {
- /* If the ef was already updated get it */
- dentry = ef->dentry;
+
+ if (IS_ERR_OR_NULL(dentry) && !ei->is_freed) {
+ /*
+ * When the mutex was released, something else could have
+ * created the dentry for this e_dentry. In which case
+ * use that one.
+ *
+ * Note, with the mutex held, the e_dentry cannot have content
+ * and the ei->is_freed be true at the same time.
+ */
+ dentry = ei->dentry;
if (dentry && !lookup)
dget(dentry);
mutex_unlock(&eventfs_mutex);
return dentry;
}

- if (!ef->dentry && !ef->is_freed) {
- ef->dentry = dentry;
- if (ef->ei)
- eventfs_post_create_dir(ef);
- dentry->d_fsdata = ef;
+ if (!ei->dentry && !ei->is_freed) {
+ ei->dentry = dentry;
+ eventfs_post_create_dir(ei);
+ dentry->d_fsdata = ei;
} else {
- /* A race here, should try again (unless freed) */
- invalidate = true;
-
/*
* Should never happen unless we get here due to being freed.
* Otherwise it means two dentries exist with the same name.
*/
- WARN_ON_ONCE(!ef->is_freed);
+ WARN_ON_ONCE(!ei->is_freed);
+ invalidate = true;
}
mutex_unlock(&eventfs_mutex);
if (invalidate)
@@ -349,50 +392,85 @@ create_dentry(struct eventfs_file *ef, struct dentry *parent, bool lookup)
return invalidate ? NULL : dentry;
}

-static bool match_event_file(struct eventfs_file *ef, const char *name)
-{
- bool ret;
-
- mutex_lock(&eventfs_mutex);
- ret = !ef->is_freed && strcmp(ef->name, name) == 0;
- mutex_unlock(&eventfs_mutex);
-
- return ret;
-}
-
/**
* eventfs_root_lookup - lookup routine to create file/dir
* @dir: in which a lookup is being done
* @dentry: file/dir dentry
- * @flags: to pass as flags parameter to simple lookup
+ * @flags: Just passed to simple_lookup()
*
- * Used to create a dynamic file/dir within @dir. Use the eventfs_inode
- * list of meta data to find the information needed to create the file/dir.
+ * Used to create dynamic file/dir with-in @dir, search with-in @ei
+ * list, if @dentry found go ahead and create the file/dir
*/
+
static struct dentry *eventfs_root_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)
{
+ const struct file_operations *fops;
+ const struct eventfs_entry *entry;
+ struct eventfs_inode *ei_child;
struct tracefs_inode *ti;
struct eventfs_inode *ei;
- struct eventfs_file *ef;
+ struct dentry *ei_dentry = NULL;
struct dentry *ret = NULL;
+ const char *name = dentry->d_name.name;
+ bool created = false;
+ umode_t mode;
+ void *data;
int idx;
+ int i;
+ int r;

ti = get_tracefs(dir);
if (!(ti->flags & TRACEFS_EVENT_INODE))
return NULL;

- ei = ti->private;
+ /* Grab srcu to prevent the ei from going away */
idx = srcu_read_lock(&eventfs_srcu);
- list_for_each_entry_srcu(ef, &ei->e_top_files, list,
+
+ /*
+ * Grab the eventfs_mutex to consistent value from ti->private.
+ * This s
+ */
+ mutex_lock(&eventfs_mutex);
+ ei = READ_ONCE(ti->private);
+ if (ei)
+ ei_dentry = READ_ONCE(ei->dentry);
+ mutex_unlock(&eventfs_mutex);
+
+ if (!ei || !ei_dentry)
+ goto out;
+
+ data = ei->data;
+
+ list_for_each_entry_srcu(ei_child, &ei->children, list,
srcu_read_lock_held(&eventfs_srcu)) {
- if (!match_event_file(ef, dentry->d_name.name))
+ if (strcmp(ei_child->name, name) != 0)
continue;
ret = simple_lookup(dir, dentry, flags);
- create_dentry(ef, ef->d_parent, true);
+ create_dir_dentry(ei_child, ei_dentry, true);
+ created = true;
break;
}
+
+ if (created)
+ goto out;
+
+ for (i = 0; i < ei->nr_entries; i++) {
+ entry = &ei->entries[i];
+ if (strcmp(name, entry->name) == 0) {
+ void *cdata = data;
+ r = entry->callback(name, &mode, &cdata, &fops);
+ if (r <= 0)
+ continue;
+ ret = simple_lookup(dir, dentry, flags);
+ create_file_dentry(ei, &ei->d_children[i],
+ ei_dentry, name, mode, cdata,
+ fops, true);
+ break;
+ }
+ }
+ out:
srcu_read_unlock(&eventfs_srcu, idx);
return ret;
}
@@ -432,29 +510,48 @@ static int eventfs_release(struct inode *inode, struct file *file)
return dcache_dir_close(inode, file);
}

+static int add_dentries(struct dentry ***dentries, struct dentry *d, int cnt)
+{
+ struct dentry **tmp;
+
+ tmp = krealloc(*dentries, sizeof(d) * (cnt + 2), GFP_KERNEL);
+ if (!tmp)
+ return -1;
+ tmp[cnt] = d;
+ tmp[cnt + 1] = NULL;
+ *dentries = tmp;
+ return 0;
+}
+
/**
* dcache_dir_open_wrapper - eventfs open wrapper
* @inode: not used
- * @file: dir to be opened (to create its child)
+ * @file: dir to be opened (to create it's children)
*
- * Used to dynamically create the file/dir within @file. @file is really a
- * directory and all the files/dirs of the children within @file will be
- * created. If any of the files/dirs have already been created, their
- * reference count will be incremented.
+ * Used to dynamic create file/dir with-in @file, all the
+ * file/dir will be created. If already created then references
+ * will be increased
*/
static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
{
+ const struct file_operations *fops;
+ const struct eventfs_entry *entry;
+ struct eventfs_inode *ei_child;
struct tracefs_inode *ti;
struct eventfs_inode *ei;
- struct eventfs_file *ef;
struct dentry_list *dlist;
struct dentry **dentries = NULL;
- struct dentry *dentry = file_dentry(file);
+ struct dentry *parent = file_dentry(file);
struct dentry *d;
struct inode *f_inode = file_inode(file);
+ const char *name = parent->d_name.name;
+ umode_t mode;
+ void *data;
int cnt = 0;
int idx;
int ret;
+ int i;
+ int r;

ti = get_tracefs(f_inode);
if (!(ti->flags & TRACEFS_EVENT_INODE))
@@ -463,25 +560,51 @@ static int dcache_dir_open_wrapper(struct inode *inode, struct file *file)
if (WARN_ON_ONCE(file->private_data))
return -EINVAL;

+ idx = srcu_read_lock(&eventfs_srcu);
+
+ mutex_lock(&eventfs_mutex);
+ ei = READ_ONCE(ti->private);
+ mutex_unlock(&eventfs_mutex);
+
+ if (!ei) {
+ srcu_read_unlock(&eventfs_srcu, idx);
+ return -EINVAL;
+ }
+
+
+ data = ei->data;
+
dlist = kmalloc(sizeof(*dlist), GFP_KERNEL);
- if (!dlist)
+ if (!dlist) {
+ srcu_read_unlock(&eventfs_srcu, idx);
return -ENOMEM;
+ }

- ei = ti->private;
- idx = srcu_read_lock(&eventfs_srcu);
- list_for_each_entry_srcu(ef, &ei->e_top_files, list,
+ list_for_each_entry_srcu(ei_child, &ei->children, list,
srcu_read_lock_held(&eventfs_srcu)) {
- d = create_dentry(ef, dentry, false);
+ d = create_dir_dentry(ei_child, parent, false);
if (d) {
- struct dentry **tmp;
+ ret = add_dentries(&dentries, d, cnt);
+ if (ret < 0)
+ break;
+ cnt++;
+ }
+ }

- tmp = krealloc(dentries, sizeof(d) * (cnt + 2), GFP_KERNEL);
- if (!tmp)
+ for (i = 0; i < ei->nr_entries; i++) {
+ void *cdata = data;
+ entry = &ei->entries[i];
+ name = entry->name;
+ r = entry->callback(name, &mode, &cdata, &fops);
+ if (r <= 0)
+ continue;
+ d = create_file_dentry(ei, &ei->d_children[i],
+ parent, name, mode, cdata, fops, false);
+ if (d) {
+ ret = add_dentries(&dentries, d, cnt);
+ if (ret < 0)
break;
- tmp[cnt] = d;
- tmp[cnt + 1] = NULL;
cnt++;
- dentries = tmp;
}
}
srcu_read_unlock(&eventfs_srcu, idx);
@@ -514,63 +637,90 @@ static int dcache_readdir_wrapper(struct file *file, struct dir_context *ctx)
}

/**
- * eventfs_prepare_ef - helper function to prepare eventfs_file
- * @name: the name of the file/directory to create.
- * @mode: the permission that the file should have.
- * @fop: struct file_operations that should be used for this file/directory.
- * @iop: struct inode_operations that should be used for this file/directory.
- * @data: something that the caller will want to get to later on. The
- * inode.i_private pointer will point to this value on the open() call.
+ * eventfs_create_dir - Create the eventfs_inode for this directory
+ * @name: The name of the directory to create.
+ * @parent: The eventfs_inode of the parent directory.
+ * @entries: A list of entries that represent the files under this directory
+ * @size: The number of @entries
+ * @data: The default data to pass to the files (an entry may override it).
+ *
+ * This function creates the descriptor to represent a directory in the
+ * eventfs. This descriptor is an eventfs_inode, and it is returned to be
+ * used to create other children underneath.
+ *
+ * The @entries is an array of eventfs_entry structures which has:
+ * const char *name
+ * eventfs_callback callback;
+ *
+ * The name is the name of the file, and the callback is a pointer to a function
+ * that will be called when the file is reference (either by lookup or by
+ * reading a directory). The callback is of the prototype:
*
- * This function allocates and fills the eventfs_file structure.
+ * int callback(const char *name, umode_t *mode, void **data,
+ * const struct file_operations **fops);
+ *
+ * When a file needs to be created, this callback will be called with
+ * name = the name of the file being created (so that the same callback
+ * may be used for multiple files).
+ * mode = a place to set the file's mode
+ * data = A pointer to @data, and the callback may replace it, which will
+ * cause the file created to pass the new data to the open() call.
+ * fops = the fops to use for the created file.
*/
-static struct eventfs_file *eventfs_prepare_ef(const char *name, umode_t mode,
- const struct file_operations *fop,
- const struct inode_operations *iop,
- void *data)
+struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent,
+ const struct eventfs_entry *entries,
+ int size, void *data)
{
- struct eventfs_file *ef;
+ struct eventfs_inode *ei;

- ef = kzalloc(sizeof(*ef), GFP_KERNEL);
- if (!ef)
+ if (!parent)
+ return ERR_PTR(-EINVAL);
+
+ ei = kzalloc(sizeof(*ei), GFP_KERNEL);
+ if (!ei)
return ERR_PTR(-ENOMEM);

- ef->name = kstrdup(name, GFP_KERNEL);
- if (!ef->name) {
- kfree(ef);
+ ei->name = kstrdup_const(name, GFP_KERNEL);
+ if (!ei->name) {
+ kfree(ei);
return ERR_PTR(-ENOMEM);
}

- if (S_ISDIR(mode)) {
- ef->ei = kzalloc(sizeof(*ef->ei), GFP_KERNEL);
- if (!ef->ei) {
- kfree(ef->name);
- kfree(ef);
+ if (size) {
+ ei->d_children = kzalloc(sizeof(*ei->d_children) * size, GFP_KERNEL);
+ if (!ei->d_children) {
+ kfree_const(ei->name);
+ kfree(ei);
return ERR_PTR(-ENOMEM);
}
- INIT_LIST_HEAD(&ef->ei->e_top_files);
- } else {
- ef->ei = NULL;
}

- ef->iop = iop;
- ef->fop = fop;
- ef->mode = mode;
- ef->data = data;
- return ef;
+ ei->entries = entries;
+ ei->nr_entries = size;
+ ei->data = data;
+ INIT_LIST_HEAD(&ei->children);
+
+ mutex_lock(&eventfs_mutex);
+ list_add_tail(&ei->list, &parent->children);
+ ei->d_parent = parent->dentry;
+ mutex_unlock(&eventfs_mutex);
+
+ return ei;
}

/**
- * eventfs_create_events_dir - create the trace event structure
- * @name: the name of the directory to create.
- * @parent: parent dentry for this file. This should be a directory dentry
- * if set. If this parameter is NULL, then the directory will be
- * created in the root of the tracefs filesystem.
+ * eventfs_create_events_dir - create the top level events directory
+ * @name: The name of the top level directory to create.
+ * @parent: Parent dentry for this file in the tracefs directory.
+ * @entries: A list of entries that represent the files under this directory
+ * @size: The number of @entries
+ * @data: The default data to pass to the files (an entry may override it).
*
* This function creates the top of the trace event directory.
*/
-struct dentry *eventfs_create_events_dir(const char *name,
- struct dentry *parent)
+struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
+ const struct eventfs_entry *entries,
+ int size, void *data)
{
struct dentry *dentry = tracefs_start_creating(name, parent);
struct eventfs_inode *ei;
@@ -581,19 +731,32 @@ struct dentry *eventfs_create_events_dir(const char *name,
return NULL;

if (IS_ERR(dentry))
- return dentry;
+ return (struct eventfs_inode *)dentry;

ei = kzalloc(sizeof(*ei), GFP_KERNEL);
if (!ei)
- return ERR_PTR(-ENOMEM);
+ goto fail;
+
inode = tracefs_get_inode(dentry->d_sb);
- if (unlikely(!inode)) {
- kfree(ei);
- tracefs_failed_creating(dentry);
- return ERR_PTR(-ENOMEM);
+ if (unlikely(!inode))
+ goto fail;
+
+ if (size) {
+ ei->d_children = kzalloc(sizeof(*ei->d_children) * size, GFP_KERNEL);
+ if (!ei->d_children)
+ goto fail;
}

- INIT_LIST_HEAD(&ei->e_top_files);
+ ei->dentry = dentry;
+ ei->entries = entries;
+ ei->nr_entries = size;
+ ei->data = data;
+ ei->name = kstrdup_const(name, GFP_KERNEL);
+ if (!ei->name)
+ goto fail;
+
+ INIT_LIST_HEAD(&ei->children);
+ INIT_LIST_HEAD(&ei->list);

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
@@ -608,193 +771,41 @@ struct dentry *eventfs_create_events_dir(const char *name,
d_instantiate(dentry, inode);
inc_nlink(dentry->d_parent->d_inode);
fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
- return tracefs_end_creating(dentry);
-}
+ tracefs_end_creating(dentry);

-/**
- * eventfs_add_subsystem_dir - add eventfs subsystem_dir to list to create later
- * @name: the name of the file to create.
- * @parent: parent dentry for this dir.
- *
- * This function adds eventfs subsystem dir to list.
- * And all these dirs are created on the fly when they are looked up,
- * and the dentry and inodes will be removed when they are done.
- */
-struct eventfs_file *eventfs_add_subsystem_dir(const char *name,
- struct dentry *parent)
-{
- struct tracefs_inode *ti_parent;
- struct eventfs_inode *ei_parent;
- struct eventfs_file *ef;
+ /* Will call dput when the directory is removed */
+ dget(dentry);

- if (security_locked_down(LOCKDOWN_TRACEFS))
- return NULL;
-
- if (!parent)
- return ERR_PTR(-EINVAL);
-
- ti_parent = get_tracefs(parent->d_inode);
- ei_parent = ti_parent->private;
+ return ei;

- ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);
- if (IS_ERR(ef))
- return ef;
-
- mutex_lock(&eventfs_mutex);
- list_add_tail(&ef->list, &ei_parent->e_top_files);
- ef->d_parent = parent;
- mutex_unlock(&eventfs_mutex);
- return ef;
+ fail:
+ kfree(ei->d_children);
+ kfree(ei);
+ tracefs_failed_creating(dentry);
+ return ERR_PTR(-ENOMEM);
}

-/**
- * eventfs_add_dir - add eventfs dir to list to create later
- * @name: the name of the file to create.
- * @ef_parent: parent eventfs_file for this dir.
- *
- * This function adds eventfs dir to list.
- * And all these dirs are created on the fly when they are looked up,
- * and the dentry and inodes will be removed when they are done.
- */
-struct eventfs_file *eventfs_add_dir(const char *name,
- struct eventfs_file *ef_parent)
+static void free_ei(struct rcu_head *head)
{
- struct eventfs_file *ef;
-
- if (security_locked_down(LOCKDOWN_TRACEFS))
- return NULL;
+ struct eventfs_inode *ei = container_of(head, struct eventfs_inode, rcu);

- if (!ef_parent)
- return ERR_PTR(-EINVAL);
-
- ef = eventfs_prepare_ef(name, S_IFDIR, NULL, NULL, NULL);
- if (IS_ERR(ef))
- return ef;
-
- mutex_lock(&eventfs_mutex);
- list_add_tail(&ef->list, &ef_parent->ei->e_top_files);
- ef->d_parent = ef_parent->dentry;
- mutex_unlock(&eventfs_mutex);
- return ef;
-}
-
-/**
- * eventfs_add_events_file - add the data needed to create a file for later reference
- * @name: the name of the file to create.
- * @mode: the permission that the file should have.
- * @parent: parent dentry for this file.
- * @data: something that the caller will want to get to later on.
- * @fop: struct file_operations that should be used for this file.
- *
- * This function is used to add the information needed to create a
- * dentry/inode within the top level events directory. The file created
- * will have the @mode permissions. The @data will be used to fill the
- * inode.i_private when the open() call is done. The dentry and inodes are
- * all created when they are referenced, and removed when they are no
- * longer referenced.
- */
-int eventfs_add_events_file(const char *name, umode_t mode,
- struct dentry *parent, void *data,
- const struct file_operations *fop)
-{
- struct tracefs_inode *ti;
- struct eventfs_inode *ei;
- struct eventfs_file *ef;
-
- if (security_locked_down(LOCKDOWN_TRACEFS))
- return -ENODEV;
-
- if (!parent)
- return -EINVAL;
-
- if (!(mode & S_IFMT))
- mode |= S_IFREG;
-
- if (!parent->d_inode)
- return -EINVAL;
-
- ti = get_tracefs(parent->d_inode);
- if (!(ti->flags & TRACEFS_EVENT_INODE))
- return -EINVAL;
-
- ei = ti->private;
- ef = eventfs_prepare_ef(name, mode, fop, NULL, data);
-
- if (IS_ERR(ef))
- return -ENOMEM;
-
- mutex_lock(&eventfs_mutex);
- list_add_tail(&ef->list, &ei->e_top_files);
- ef->d_parent = parent;
- mutex_unlock(&eventfs_mutex);
- return 0;
-}
-
-/**
- * eventfs_add_file - add eventfs file to list to create later
- * @name: the name of the file to create.
- * @mode: the permission that the file should have.
- * @ef_parent: parent eventfs_file for this file.
- * @data: something that the caller will want to get to later on.
- * @fop: struct file_operations that should be used for this file.
- *
- * This function is used to add the information needed to create a
- * file within a subdirectory of the events directory. The file created
- * will have the @mode permissions. The @data will be used to fill the
- * inode.i_private when the open() call is done. The dentry and inodes are
- * all created when they are referenced, and removed when they are no
- * longer referenced.
- */
-int eventfs_add_file(const char *name, umode_t mode,
- struct eventfs_file *ef_parent,
- void *data,
- const struct file_operations *fop)
-{
- struct eventfs_file *ef;
-
- if (security_locked_down(LOCKDOWN_TRACEFS))
- return -ENODEV;
-
- if (!ef_parent)
- return -EINVAL;
-
- if (!(mode & S_IFMT))
- mode |= S_IFREG;
-
- ef = eventfs_prepare_ef(name, mode, fop, NULL, data);
- if (IS_ERR(ef))
- return -ENOMEM;
-
- mutex_lock(&eventfs_mutex);
- list_add_tail(&ef->list, &ef_parent->ei->e_top_files);
- ef->d_parent = ef_parent->dentry;
- mutex_unlock(&eventfs_mutex);
- return 0;
-}
-
-static void free_ef(struct rcu_head *head)
-{
- struct eventfs_file *ef = container_of(head, struct eventfs_file, rcu);
-
- kfree(ef->name);
- kfree(ef->ei);
- kfree(ef);
+ kfree_const(ei->name);
+ kfree(ei->d_children);
+ kfree(ei);
}

/**
* eventfs_remove_rec - remove eventfs dir or file from list
- * @ef: eventfs_file to be removed.
- * @head: to create list of eventfs_file to be deleted
- * @level: to check recursion depth
+ * @ei: eventfs_inode to be removed.
*
- * The helper function eventfs_remove_rec() is used to clean up and free the
- * associated data from eventfs for both of the added functions.
+ * This function recursively remove eventfs_inode which
+ * contains info of file or dir.
*/
-static void eventfs_remove_rec(struct eventfs_file *ef, struct list_head *head, int level)
+static void eventfs_remove_rec(struct eventfs_inode *ei, struct list_head *head, int level)
{
- struct eventfs_file *ef_child;
+ struct eventfs_inode *ei_child;

- if (!ef)
+ if (!ei)
return;
/*
* Check recursion depth. It should never be greater than 3:
@@ -806,62 +817,68 @@ static void eventfs_remove_rec(struct eventfs_file *ef, struct list_head *head,
if (WARN_ON_ONCE(level > 3))
return;

- if (ef->ei) {
- /* search for nested folders or files */
- list_for_each_entry_srcu(ef_child, &ef->ei->e_top_files, list,
- lockdep_is_held(&eventfs_mutex)) {
- eventfs_remove_rec(ef_child, head, level + 1);
- }
+ /* search for nested folders or files */
+ list_for_each_entry_srcu(ei_child, &ei->children, list,
+ lockdep_is_held(&eventfs_mutex)) {
+ eventfs_remove_rec(ei_child, head, level + 1);
}

- list_del_rcu(&ef->list);
- list_add_tail(&ef->del_list, head);
+ list_del_rcu(&ei->list);
+ list_add_tail(&ei->del_list, head);
}

+static void unhook_dentry(struct dentry **dentry, struct dentry **list)
+{
+ if (*dentry) {
+ unsigned long ptr = (unsigned long)*list;
+
+ /* Keep the dentry from being freed yet */
+ dget(*dentry);
+
+ /*
+ * Paranoid: The dget() above should prevent the dentry
+ * from being freed and calling eventfs_set_ei_status_free().
+ * But just in case, set the link list LSB pointer to 1
+ * and have eventfs_set_ei_status_free() check that to
+ * make sure that if it does happen, it will not think
+ * the d_fsdata is an eventfs_inode.
+ *
+ * For this to work, no eventfs_inode should be allocated
+ * on a odd space, as the ef should always be allocated
+ * to be at least word aligned. Check for that too.
+ */
+ WARN_ON_ONCE(ptr & 1);
+
+ (*dentry)->d_fsdata = (void *)(ptr | 1);
+ *list = *dentry;
+ *dentry = NULL;
+ }
+}
/**
* eventfs_remove - remove eventfs dir or file from list
- * @ef: eventfs_file to be removed.
+ * @ei: eventfs_inode to be removed.
*
* This function acquire the eventfs_mutex lock and call eventfs_remove_rec()
*/
-void eventfs_remove(struct eventfs_file *ef)
+void eventfs_remove_dir(struct eventfs_inode *ei)
{
- struct eventfs_file *tmp;
- LIST_HEAD(ef_del_list);
+ struct eventfs_inode *tmp;
+ LIST_HEAD(ei_del_list);
struct dentry *dentry_list = NULL;
struct dentry *dentry;
+ int i;

- if (!ef)
+ if (!ei)
return;

mutex_lock(&eventfs_mutex);
- eventfs_remove_rec(ef, &ef_del_list, 0);
- list_for_each_entry_safe(ef, tmp, &ef_del_list, del_list) {
- if (ef->dentry) {
- unsigned long ptr = (unsigned long)dentry_list;
-
- /* Keep the dentry from being freed yet */
- dget(ef->dentry);
-
- /*
- * Paranoid: The dget() above should prevent the dentry
- * from being freed and calling eventfs_set_ef_status_free().
- * But just in case, set the link list LSB pointer to 1
- * and have eventfs_set_ef_status_free() check that to
- * make sure that if it does happen, it will not think
- * the d_fsdata is an event_file.
- *
- * For this to work, no event_file should be allocated
- * on a odd space, as the ef should always be allocated
- * to be at least word aligned. Check for that too.
- */
- WARN_ON_ONCE(ptr & 1);
-
- ef->dentry->d_fsdata = (void *)(ptr | 1);
- dentry_list = ef->dentry;
- ef->dentry = NULL;
- }
- call_srcu(&eventfs_srcu, &ef->rcu, free_ef);
+ eventfs_remove_rec(ei, &ei_del_list, 0);
+
+ list_for_each_entry_safe(ei, tmp, &ei_del_list, del_list) {
+ for (i = 0; i < ei->nr_entries; i++)
+ unhook_dentry(&ei->d_children[i], &dentry_list);
+ unhook_dentry(&ei->dentry, &dentry_list);
+ call_srcu(&eventfs_srcu, &ei->rcu, free_ei);
}
mutex_unlock(&eventfs_mutex);

@@ -876,8 +893,8 @@ void eventfs_remove(struct eventfs_file *ef)
mutex_lock(&eventfs_mutex);
/* dentry should now have at least a single reference */
WARN_ONCE((int)d_count(dentry) < 1,
- "dentry %p less than one reference (%d) after invalidate\n",
- dentry, d_count(dentry));
+ "dentry %px (%s) less than one reference (%d) after invalidate\n",
+ dentry, dentry->d_name.name, d_count(dentry));
mutex_unlock(&eventfs_mutex);
dput(dentry);
}
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 891653ba9cf3..34ffb2f8114e 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -385,7 +385,7 @@ static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)

ti = get_tracefs(inode);
if (ti && ti->flags & TRACEFS_EVENT_INODE)
- eventfs_set_ef_status_free(ti, dentry);
+ eventfs_set_ei_status_free(ti, dentry);
iput(inode);
}

diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 4f2e49e2197b..298d3ecaf621 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -13,6 +13,41 @@ struct tracefs_inode {
struct inode vfs_inode;
};

+/*
+ * struct eventfs_inode - hold the properties of the eventfs directories.
+ * @list: link list into the parent directory
+ * @entries: the array of entries representing the files in the directory
+ * @name: the name of the directory to create
+ * @children: link list into the child eventfs_inode
+ * @dentry: the dentry of the directory
+ * @d_parent: pointer to the parent's dentry
+ * @d_children: The array of dentries to represent the files when created
+ * @data: The private data to pass to the callbacks
+ * @nr_entries: The number of items in @entries
+ */
+struct eventfs_inode {
+ struct list_head list;
+ const struct eventfs_entry *entries;
+ const char *name;
+ struct list_head children;
+ struct dentry *dentry;
+ struct dentry *d_parent;
+ struct dentry **d_children;
+ void *data;
+ /*
+ * Union - used for deletion
+ * @del_list: list of eventfs_inode to delete
+ * @rcu: eventfs_indoe to delete in RCU
+ * @is_freed: node is freed if one of the above is set
+ */
+ union {
+ struct list_head del_list;
+ struct rcu_head rcu;
+ unsigned long is_freed;
+ };
+ int nr_entries;
+};
+
static inline struct tracefs_inode *get_tracefs(const struct inode *inode)
{
return container_of(inode, struct tracefs_inode, vfs_inode);
@@ -25,6 +60,6 @@ struct inode *tracefs_get_inode(struct super_block *sb);
struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
struct dentry *eventfs_failed_creating(struct dentry *dentry);
struct dentry *eventfs_end_creating(struct dentry *dentry);
-void eventfs_set_ef_status_free(struct tracefs_inode *ti, struct dentry *dentry);
+void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry);

#endif /* _TRACEFS_INTERNAL_H */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 21ae37e49319..12207dc6722d 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -649,7 +649,7 @@ struct trace_event_file {
struct list_head list;
struct trace_event_call *event_call;
struct event_filter __rcu *filter;
- struct eventfs_file *ef;
+ struct eventfs_inode *ei;
struct trace_array *tr;
struct trace_subsystem_dir *system;
struct list_head triggers;
diff --git a/include/linux/tracefs.h b/include/linux/tracefs.h
index 009072792fa3..0c39704455d9 100644
--- a/include/linux/tracefs.h
+++ b/include/linux/tracefs.h
@@ -23,26 +23,25 @@ struct file_operations;

struct eventfs_file;

-struct dentry *eventfs_create_events_dir(const char *name,
- struct dentry *parent);
+typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data,
+ const struct file_operations **fops);

-struct eventfs_file *eventfs_add_subsystem_dir(const char *name,
- struct dentry *parent);
+struct eventfs_entry {
+ const char *name;
+ eventfs_callback callback;
+};

-struct eventfs_file *eventfs_add_dir(const char *name,
- struct eventfs_file *ef_parent);
+struct eventfs_inode;

-int eventfs_add_file(const char *name, umode_t mode,
- struct eventfs_file *ef_parent, void *data,
- const struct file_operations *fops);
+struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent,
+ const struct eventfs_entry *entries,
+ int size, void *data);

-int eventfs_add_events_file(const char *name, umode_t mode,
- struct dentry *parent, void *data,
- const struct file_operations *fops);
+struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent,
+ const struct eventfs_entry *entries,
+ int size, void *data);

-void eventfs_remove(struct eventfs_file *ef);
-
-void eventfs_remove_events_dir(struct dentry *dentry);
+void eventfs_remove_dir(struct eventfs_inode *ei);

struct dentry *tracefs_create_file(const char *name, umode_t mode,
struct dentry *parent, void *data,
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index dd6395692ff9..4383be8fa1b0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9764,7 +9764,6 @@ static __init void create_trace_instances(struct dentry *d_tracer)
static void
init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
{
- struct trace_event_file *file;
int cpu;

trace_create_file("available_tracers", TRACE_MODE_READ, d_tracer,
@@ -9797,11 +9796,7 @@ init_tracer_tracefs(struct trace_array *tr, struct dentry *d_tracer)
trace_create_file("trace_marker", 0220, d_tracer,
tr, &tracing_mark_fops);

- file = __find_event_file(tr, "ftrace", "print");
- if (file && file->ef)
- eventfs_add_file("trigger", TRACE_MODE_WRITE, file->ef,
- file, &event_trigger_fops);
- tr->trace_marker_file = file;
+ tr->trace_marker_file = __find_event_file(tr, "ftrace", "print");

trace_create_file("trace_marker_raw", 0220, d_tracer,
tr, &tracing_mark_raw_fops);
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index e92cb9c1292f..0e1405abf4f7 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -381,7 +381,7 @@ struct trace_array {
struct dentry *dir;
struct dentry *options;
struct dentry *percpu_dir;
- struct dentry *event_dir;
+ struct eventfs_inode *event_dir;
struct trace_options *topts;
struct list_head systems;
struct list_head events;
@@ -1349,7 +1349,7 @@ struct trace_subsystem_dir {
struct list_head list;
struct event_subsystem *subsystem;
struct trace_array *tr;
- struct eventfs_file *ef;
+ struct eventfs_inode *ei;
int ref_count;
int nr_events;
};
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 099b9b6bbdc4..a3b9d9423824 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -984,7 +984,7 @@ static void remove_subsystem(struct trace_subsystem_dir *dir)
return;

if (!--dir->nr_events) {
- eventfs_remove(dir->ef);
+ eventfs_remove_dir(dir->ei);
list_del(&dir->list);
__put_system_dir(dir);
}
@@ -992,7 +992,7 @@ static void remove_subsystem(struct trace_subsystem_dir *dir)

static void remove_event_file_dir(struct trace_event_file *file)
{
- eventfs_remove(file->ef);
+ eventfs_remove_dir(file->ei);
list_del(&file->list);
remove_subsystem(file->system);
free_event_filter(file->filter);
@@ -2282,14 +2282,40 @@ create_new_subsystem(const char *name)
return NULL;
}

-static struct eventfs_file *
+int system_callback(const char *name, umode_t *mode, void **data,
+ const struct file_operations **fops)
+{
+ if (strcmp(name, "filter") == 0)
+ *fops = &ftrace_subsystem_filter_fops;
+
+ else if (strcmp(name, "enable") == 0)
+ *fops = &ftrace_system_enable_fops;
+
+ else
+ return 0;
+
+ *mode = TRACE_MODE_WRITE;
+ return 1;
+}
+
+static struct eventfs_inode *
event_subsystem_dir(struct trace_array *tr, const char *name,
- struct trace_event_file *file, struct dentry *parent)
+ struct trace_event_file *file, struct eventfs_inode *parent)
{
struct event_subsystem *system, *iter;
struct trace_subsystem_dir *dir;
- struct eventfs_file *ef;
- int res;
+ struct eventfs_inode *ei;
+ int nr_entries;
+ static struct eventfs_entry system_entries[] = {
+ {
+ .name = "filter",
+ .callback = system_callback,
+ },
+ {
+ .name = "enable",
+ .callback = system_callback,
+ }
+ };

/* First see if we did not already create this dir */
list_for_each_entry(dir, &tr->systems, list) {
@@ -2297,7 +2323,7 @@ event_subsystem_dir(struct trace_array *tr, const char *name,
if (strcmp(system->name, name) == 0) {
dir->nr_events++;
file->system = dir;
- return dir->ef;
+ return dir->ei;
}
}

@@ -2321,39 +2347,29 @@ event_subsystem_dir(struct trace_array *tr, const char *name,
} else
__get_system(system);

- ef = eventfs_add_subsystem_dir(name, parent);
- if (IS_ERR(ef)) {
+ /* ftrace only has directories no files */
+ if (strcmp(name, "ftrace") == 0)
+ nr_entries = 0;
+ else
+ nr_entries = ARRAY_SIZE(system_entries);
+
+ ei = eventfs_create_dir(name, parent, system_entries, nr_entries, dir);
+ if (!ei) {
pr_warn("Failed to create system directory %s\n", name);
__put_system(system);
goto out_free;
}

- dir->ef = ef;
+ dir->ei = ei;
dir->tr = tr;
dir->ref_count = 1;
dir->nr_events = 1;
dir->subsystem = system;
file->system = dir;

- /* the ftrace system is special, do not create enable or filter files */
- if (strcmp(name, "ftrace") != 0) {
-
- res = eventfs_add_file("filter", TRACE_MODE_WRITE,
- dir->ef, dir,
- &ftrace_subsystem_filter_fops);
- if (res) {
- kfree(system->filter);
- system->filter = NULL;
- pr_warn("Could not create tracefs '%s/filter' entry\n", name);
- }
-
- eventfs_add_file("enable", TRACE_MODE_WRITE, dir->ef, dir,
- &ftrace_system_enable_fops);
- }
-
list_add(&dir->list, &tr->systems);

- return dir->ef;
+ return dir->ei;

out_free:
kfree(dir);
@@ -2402,15 +2418,134 @@ event_define_fields(struct trace_event_call *call)
return ret;
}

+static int event_callback(const char *name, umode_t *mode, void **data,
+ const struct file_operations **fops)
+{
+ struct trace_event_file *file = *data;
+ struct trace_event_call *call = file->event_call;
+
+ if (strcmp(name, "format") == 0) {
+ *mode = TRACE_MODE_READ;
+ *fops = &ftrace_event_format_fops;
+ *data = call;
+ return 1;
+ }
+
+ /*
+ * Only event directories that can be enabled should have
+ * triggers or filters, with the exception of the "print"
+ * event that can have a "trigger" file.
+ */
+ if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) {
+ if (call->class->reg && strcmp(name, "enable") == 0) {
+ *mode = TRACE_MODE_WRITE;
+ *fops = &ftrace_enable_fops;
+ return 1;
+ }
+
+ if (strcmp(name, "filter") == 0) {
+ *mode = TRACE_MODE_WRITE;
+ *fops = &ftrace_event_filter_fops;
+ return 1;
+ }
+ }
+
+ if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE) ||
+ strcmp(trace_event_name(call), "print") == 0) {
+ if (strcmp(name, "trigger") == 0) {
+ *mode = TRACE_MODE_WRITE;
+ *fops = &event_trigger_fops;
+ return 1;
+ }
+ }
+
+#ifdef CONFIG_PERF_EVENTS
+ if (call->event.type && call->class->reg &&
+ strcmp(name, "id") == 0) {
+ *mode = TRACE_MODE_READ;
+ *data = (void *)(long)call->event.type;
+ *fops = &ftrace_event_id_fops;
+ return 1;
+ }
+#endif
+
+#ifdef CONFIG_HIST_TRIGGERS
+ if (strcmp(name, "hist") == 0) {
+ *mode = TRACE_MODE_READ;
+ *fops = &event_hist_fops;
+ return 1;
+ }
+#endif
+#ifdef CONFIG_HIST_TRIGGERS_DEBUG
+ if (strcmp(name, "hist_debug") == 0) {
+ *mode = TRACE_MODE_READ;
+ *fops = &event_hist_debug_fops;
+ return 1;
+ }
+#endif
+#ifdef CONFIG_TRACE_EVENT_INJECT
+ if (call->event.type && call->class->reg &&
+ strcmp(name, "inject") == 0) {
+ *mode = 0200;
+ *fops = &event_inject_fops;
+ return 1;
+ }
+#endif
+ return 0;
+}
+
static int
-event_create_dir(struct dentry *parent, struct trace_event_file *file)
+event_create_dir(struct eventfs_inode *parent, struct trace_event_file *file)
{
struct trace_event_call *call = file->event_call;
- struct eventfs_file *ef_subsystem = NULL;
struct trace_array *tr = file->tr;
- struct eventfs_file *ef;
+ struct eventfs_inode *e_events;
+ struct eventfs_inode *ei;
const char *name;
+ int nr_entries;
int ret;
+ static struct eventfs_entry event_entries[] = {
+ {
+ .name = "enable",
+ .callback = event_callback,
+ },
+ {
+ .name = "filter",
+ .callback = event_callback,
+ },
+ {
+ .name = "trigger",
+ .callback = event_callback,
+ },
+ {
+ .name = "format",
+ .callback = event_callback,
+ },
+#ifdef CONFIG_PERF_EVENTS
+ {
+ .name = "id",
+ .callback = event_callback,
+ },
+#endif
+#ifdef CONFIG_HIST_TRIGGERS
+ {
+ .name = "hist",
+ .callback = event_callback,
+ },
+#endif
+#ifdef CONFIG_HIST_TRIGGERS_DEBUG
+ {
+ .name = "hist_debug",
+ .callback = event_callback,
+ },
+#endif
+#ifdef CONFIG_TRACE_EVENT_INJECT
+ {
+ .name = "inject",
+ .callback = event_callback,
+ },
+#endif
+ };

/*
* If the trace point header did not define TRACE_SYSTEM
@@ -2420,29 +2555,20 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file)
if (WARN_ON_ONCE(strcmp(call->class->system, TRACE_SYSTEM) == 0))
return -ENODEV;

- ef_subsystem = event_subsystem_dir(tr, call->class->system, file, parent);
- if (!ef_subsystem)
+ e_events = event_subsystem_dir(tr, call->class->system, file, parent);
+ if (!e_events)
return -ENOMEM;

+ nr_entries = ARRAY_SIZE(event_entries);
+
name = trace_event_name(call);
- ef = eventfs_add_dir(name, ef_subsystem);
- if (IS_ERR(ef)) {
+ ei = eventfs_create_dir(name, e_events, event_entries, nr_entries, file);
+ if (IS_ERR(ei)) {
pr_warn("Could not create tracefs '%s' directory\n", name);
return -1;
}

- file->ef = ef;
-
- if (call->class->reg && !(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE))
- eventfs_add_file("enable", TRACE_MODE_WRITE, file->ef, file,
- &ftrace_enable_fops);
-
-#ifdef CONFIG_PERF_EVENTS
- if (call->event.type && call->class->reg)
- eventfs_add_file("id", TRACE_MODE_READ, file->ef,
- (void *)(long)call->event.type,
- &ftrace_event_id_fops);
-#endif
+ file->ei = ei;

ret = event_define_fields(call);
if (ret < 0) {
@@ -2450,35 +2576,6 @@ event_create_dir(struct dentry *parent, struct trace_event_file *file)
return ret;
}

- /*
- * Only event directories that can be enabled should have
- * triggers or filters.
- */
- if (!(call->flags & TRACE_EVENT_FL_IGNORE_ENABLE)) {
- eventfs_add_file("filter", TRACE_MODE_WRITE, file->ef,
- file, &ftrace_event_filter_fops);
-
- eventfs_add_file("trigger", TRACE_MODE_WRITE, file->ef,
- file, &event_trigger_fops);
- }
-
-#ifdef CONFIG_HIST_TRIGGERS
- eventfs_add_file("hist", TRACE_MODE_READ, file->ef, file,
- &event_hist_fops);
-#endif
-#ifdef CONFIG_HIST_TRIGGERS_DEBUG
- eventfs_add_file("hist_debug", TRACE_MODE_READ, file->ef, file,
- &event_hist_debug_fops);
-#endif
- eventfs_add_file("format", TRACE_MODE_READ, file->ef, call,
- &ftrace_event_format_fops);
-
-#ifdef CONFIG_TRACE_EVENT_INJECT
- if (call->event.type && call->class->reg)
- eventfs_add_file("inject", 0200, file->ef, file,
- &event_inject_fops);
-#endif
-
return 0;
}

@@ -3623,30 +3720,65 @@ static __init int setup_trace_event(char *str)
}
__setup("trace_event=", setup_trace_event);

+static int events_callback(const char *name, umode_t *mode, void **data,
+ const struct file_operations **fops)
+{
+ if (strcmp(name, "enable") == 0) {
+ *mode = TRACE_MODE_WRITE;
+ *fops = &ftrace_tr_enable_fops;
+ return 1;
+ }
+
+ if (strcmp(name, "header_page") == 0)
+ *data = ring_buffer_print_page_header;
+
+ else if (strcmp(name, "header_event") == 0)
+ *data = ring_buffer_print_entry_header;
+
+ else
+ return 0;
+
+ *mode = TRACE_MODE_READ;
+ *fops = &ftrace_show_header_fops;
+ return 1;
+}
+
/* Expects to have event_mutex held when called */
static int
create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
{
- struct dentry *d_events;
+ struct eventfs_inode *e_events;
struct dentry *entry;
- int error = 0;
+ int nr_entries;
+ static struct eventfs_entry events_entries[] = {
+ {
+ .name = "enable",
+ .callback = events_callback,
+ },
+ {
+ .name = "header_page",
+ .callback = events_callback,
+ },
+ {
+ .name = "header_event",
+ .callback = events_callback,
+ },
+ };

entry = trace_create_file("set_event", TRACE_MODE_WRITE, parent,
tr, &ftrace_set_event_fops);
if (!entry)
return -ENOMEM;

- d_events = eventfs_create_events_dir("events", parent);
- if (IS_ERR(d_events)) {
+ nr_entries = ARRAY_SIZE(events_entries);
+
+ e_events = eventfs_create_events_dir("events", parent, events_entries,
+ nr_entries, tr);
+ if (IS_ERR(e_events)) {
pr_warn("Could not create tracefs 'events' directory\n");
return -ENOMEM;
}

- error = eventfs_add_events_file("enable", TRACE_MODE_WRITE, d_events,
- tr, &ftrace_tr_enable_fops);
- if (error)
- return -ENOMEM;
-
/* There are not as crucial, just warn if they are not created */

trace_create_file("set_event_pid", TRACE_MODE_WRITE, parent,
@@ -3656,16 +3788,7 @@ create_event_toplevel_files(struct dentry *parent, struct trace_array *tr)
TRACE_MODE_WRITE, parent, tr,
&ftrace_set_event_notrace_pid_fops);

- /* ring buffer internal formats */
- eventfs_add_events_file("header_page", TRACE_MODE_READ, d_events,
- ring_buffer_print_page_header,
- &ftrace_show_header_fops);
-
- eventfs_add_events_file("header_event", TRACE_MODE_READ, d_events,
- ring_buffer_print_entry_header,
- &ftrace_show_header_fops);
-
- tr->event_dir = d_events;
+ tr->event_dir = e_events;

return 0;
}
@@ -3749,7 +3872,7 @@ int event_trace_del_tracer(struct trace_array *tr)

down_write(&trace_event_sem);
__trace_remove_event_dirs(tr);
- eventfs_remove_events_dir(tr->event_dir);
+ eventfs_remove_dir(tr->event_dir);
up_write(&trace_event_sem);

tr->event_dir = NULL;
--
2.40.1


2023-11-17 14:24:48

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode

Hi Steven,

On Wed, Oct 04, 2023 at 04:50:07PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <[email protected]>
>
> Instead of having a descriptor for every file represented in the eventfs
> directory, only have the directory itself represented. Change the API to
> send in a list of entries that represent all the files in the directory
> (but not other directories). The entry list contains a name and a callback
> function that will be used to create the files when they are accessed.
...
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Ajay Kaher <[email protected]>
> Signed-off-by: Steven Rostedt (Google) <[email protected]>
> ---
> Changes since v4: https://lore.kernel.org/linux-trace-kernel/[email protected]/
>
> - Get the ei->dentry within the eventfs_mutex to keep consistency during the lookup.
>
> fs/tracefs/event_inode.c | 847 ++++++++++++++++++-----------------
> fs/tracefs/inode.c | 2 +-
> fs/tracefs/internal.h | 37 +-
> include/linux/trace_events.h | 2 +-
> include/linux/tracefs.h | 29 +-
> kernel/trace/trace.c | 7 +-
> kernel/trace/trace.h | 4 +-
> kernel/trace/trace_events.c | 313 +++++++++----
> 8 files changed, 705 insertions(+), 536 deletions(-)

I think this patch causes from time to time crashes when running ftrace
selftests. In particular I guess there is a bug wrt error handling in this
function (see below for call trace):

> +static struct dentry *
> +create_file_dentry(struct eventfs_inode *ei, struct dentry **e_dentry,
> + struct dentry *parent, const char *name, umode_t mode, void *data,
> + const struct file_operations *fops, bool lookup)
> +{
> + struct dentry *dentry;
> + bool invalidate = false;
> +
> + mutex_lock(&eventfs_mutex);
> + /* If the e_dentry already has a dentry, use it */
> + if (*e_dentry) {
> + /* lookup does not need to up the ref count */
> + if (!lookup)
> + dget(*e_dentry);
> + mutex_unlock(&eventfs_mutex);
> + return *e_dentry;
> + }
> + mutex_unlock(&eventfs_mutex);
> +
> + /* The lookup already has the parent->d_inode locked */
> + if (!lookup)
> + inode_lock(parent->d_inode);
> +
> + dentry = create_file(name, mode, parent, data, fops);
> +
> + if (!lookup)
> + inode_unlock(parent->d_inode);
> +
> + mutex_lock(&eventfs_mutex);
> +
> + if (IS_ERR_OR_NULL(dentry)) {
> + /*
> + * When the mutex was released, something else could have
> + * created the dentry for this e_dentry. In which case
> + * use that one.
> + *
> + * Note, with the mutex held, the e_dentry cannot have content
> + * and the ei->is_freed be true at the same time.
> + */
> + WARN_ON_ONCE(ei->is_freed);
> + dentry = *e_dentry;
> + /* The lookup does not need to up the dentry refcount */
> + if (dentry && !lookup)
> + dget(dentry);
> + mutex_unlock(&eventfs_mutex);
> + return dentry;
> + }
> +
> + if (!*e_dentry && !ei->is_freed) {
> + *e_dentry = dentry;
> + dentry->d_fsdata = ei;
> + } else {
> + /*
> + * Should never happen unless we get here due to being freed.
> + * Otherwise it means two dentries exist with the same name.
> + */
> + WARN_ON_ONCE(!ei->is_freed);
> + invalidate = true;
> + }
> + mutex_unlock(&eventfs_mutex);
> +
> + if (invalidate)
> + d_invalidate(dentry);
> +
> + if (lookup || invalidate)
> + dput(dentry);
> +
> + return invalidate ? NULL : dentry;
> +}

We sometimes see crashes like this:

specification exception: 0006 ilc:2 [#1] SMP
CPU: 6 PID: 38815 Comm: ls Not tainted 6.7.0-20231116.rc1.git1.a7e756a5bb26.300.vr.fc38.s390x #1
Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
Krnl PSW : 0704c00180000000 000001682304bb00 (d_invalidate+0x30/0x110)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: ffffffffffffffff 000000e200000000 0000000000000047 000000e200000007
0000000000000000 ffffff7c197bf000 000000e2f13b0b20 000000e25bfae180
000000e2f2536000 ffffffffffffffef 0000000000000000 ffffffffffffffef
000003ff95cacf98 000000e2f29323f0 000000e827c1fa18 000000e827c1f9d0
Krnl Code: 000001682304baf4: a7180000 lhi %r1,0
000001682304baf8: 583003ac l %r3,940
#000001682304bafc: ba13b058 cs %r1,%r3,88(%r11)
>000001682304bb00: ec16006b007e cij %r1,0,6,000001682304bbd6
000001682304bb06: e310b0100002 ltg %r1,16(%r11)
000001682304bb0c: a784004e brc 8,000001682304bba8
000001682304bb10: b904002b lgr %r2,%r11
000001682304bb14: c0e5ffffe67e brasl %r14,0000016823048810
Call Trace:
[<000001682304bb00>] d_invalidate+0x30/0x110
[<000001682329147a>] create_dir_dentry+0xe2/0x200
[<000001682329190a>] dcache_dir_open_wrapper+0x102/0x3e8
[<000001682301fb8a>] do_dentry_open+0x24a/0x568
[<0000016823038836>] do_open+0x2de/0x448
[<000001682303cb58>] path_openat+0x110/0x2b0
[<000001682303d688>] do_filp_open+0x90/0x130
[<0000016823022960>] do_sys_openat2+0xa8/0xd8
[<0000016823022b50>] do_sys_open+0x58/0x90
[<00000168239c9edc>] __do_syscall+0x1d4/0x200
[<00000168239db1f8>] system_call+0x70/0x98
Last Breaking-Event-Address:
[<0000016823291474>] create_dir_dentry+0xdc/0x200
Kernel panic - not syncing: Fatal exception: panic_on_oops

Note that the compare and swap instruction within d_invalidate() generates
a specification exception because it operates on an invalid address
(0xffffffffffffffef), which happens to be -EEXIST. So my assumption is that
create_dir_dentry() has incorrect error handling and passes -EEXIST instead
of a valid dentry pointer to d_invalidate().

But I leave it up to you to figure this out :)

2023-11-17 14:39:23

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode

On Fri, Nov 17, 2023 at 03:23:35PM +0100, Heiko Carstens wrote:
> I think this patch causes from time to time crashes when running ftrace
> selftests. In particular I guess there is a bug wrt error handling in this
> function (see below for call trace):
>
> > +static struct dentry *
> > +create_file_dentry(struct eventfs_inode *ei, struct dentry **e_dentry,
> > + struct dentry *parent, const char *name, umode_t mode, void *data,
> > + const struct file_operations *fops, bool lookup)
> > +{
...
> Note that the compare and swap instruction within d_invalidate() generates
> a specification exception because it operates on an invalid address
> (0xffffffffffffffef), which happens to be -EEXIST. So my assumption is that
> create_dir_dentry() has incorrect error handling and passes -EEXIST instead
> of a valid dentry pointer to d_invalidate().
>
> But I leave it up to you to figure this out :)

Ok, wrong function quoted of course. But the rest of my statement
should be correct.

2023-11-23 11:27:04

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode

On Fri, Nov 17, 2023 at 03:38:29PM +0100, Heiko Carstens wrote:
> On Fri, Nov 17, 2023 at 03:23:35PM +0100, Heiko Carstens wrote:
> > I think this patch causes from time to time crashes when running ftrace
> > selftests. In particular I guess there is a bug wrt error handling in this
> > function (see below for call trace):
> >
> > > +static struct dentry *
> > > +create_file_dentry(struct eventfs_inode *ei, struct dentry **e_dentry,
> > > + struct dentry *parent, const char *name, umode_t mode, void *data,
> > > + const struct file_operations *fops, bool lookup)
> > > +{
> ...
> > Note that the compare and swap instruction within d_invalidate() generates
> > a specification exception because it operates on an invalid address
> > (0xffffffffffffffef), which happens to be -EEXIST. So my assumption is that
> > create_dir_dentry() has incorrect error handling and passes -EEXIST instead
> > of a valid dentry pointer to d_invalidate().
> >
> > But I leave it up to you to figure this out :)
>
> Ok, wrong function quoted of course. But the rest of my statement
> should be correct.

So, if it helps (this still happens with Linus' master branch):

create_dir_dentry() is called with a "struct eventfs_inode *ei" (second
parameter), which points to a data structure where "is_freed" is 1. Then it
looks like create_dir() returned "-EEXIST". And looking at the code this
combination then must lead to d_invalidate() incorrectly being called with
"-EEXIST" as dentry pointer.

Now, I have no idea how the code should work, but it is quite obvious that
something is broken :)

Here the dump of the struct eventfs_inode that was passed to
create_file_dentry() when the crash happened:

crash> struct eventfs_inode 00000000eada7680
struct eventfs_inode {
list = {
next = 0x10f802da0,
prev = 0x122
},
entries = 0x12c031328 <event_entries>,
name = 0x12b90bbac <__tpstrtab_xfs_alloc_vextent_exact_bno> "xfs_alloc_vextent_exact_bno",
children = {
next = 0xeada76a0,
prev = 0xeada76a0
},
dentry = 0x0,
d_parent = 0x107c75d40,
d_children = 0xeada5700,
entry_attrs = 0x0,
attr = {
mode = 0,
uid = {
val = 0
},
gid = {
val = 0
}
},
data = 0xeada6660,
{
llist = {
next = 0xeada7668
},
rcu = {
next = 0xeada7668,
func = 0x12ad2a5b8 <free_rcu_ei>
}
},
is_freed = 1,
nr_entries = 6
}

2023-11-23 12:34:59

by Ajay Kaher

[permalink] [raw]
Subject: Re: [PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode



> On 23-Nov-2023, at 4:55 PM, Heiko Carstens <[email protected]> wrote:
>
> !! External Email
>
> On Fri, Nov 17, 2023 at 03:38:29PM +0100, Heiko Carstens wrote:
>> On Fri, Nov 17, 2023 at 03:23:35PM +0100, Heiko Carstens wrote:
>>> I think this patch causes from time to time crashes when running ftrace
>>> selftests. In particular I guess there is a bug wrt error handling in this
>>> function (see below for call trace):
>>>
>>>> +static struct dentry *
>>>> +create_file_dentry(struct eventfs_inode *ei, struct dentry **e_dentry,
>>>> + struct dentry *parent, const char *name, umode_t mode, void *data,
>>>> + const struct file_operations *fops, bool lookup)
>>>> +{
>> ...
>>> Note that the compare and swap instruction within d_invalidate() generates
>>> a specification exception because it operates on an invalid address
>>> (0xffffffffffffffef), which happens to be -EEXIST. So my assumption is that
>>> create_dir_dentry() has incorrect error handling and passes -EEXIST instead
>>> of a valid dentry pointer to d_invalidate().
>>>
>>> But I leave it up to you to figure this out :)
>>
>> Ok, wrong function quoted of course. But the rest of my statement
>> should be correct.
>
> So, if it helps (this still happens with Linus' master branch):
>
> create_dir_dentry() is called with a "struct eventfs_inode *ei" (second
> parameter), which points to a data structure where "is_freed" is 1. Then it
> looks like create_dir() returned "-EEXIST". And looking at the code this
> combination then must lead to d_invalidate() incorrectly being called with
> "-EEXIST" as dentry pointer.
>
> Now, I have no idea how the code should work, but it is quite obvious that
> something is broken :)
>
> Here the dump of the struct eventfs_inode that was passed to
> create_file_dentry() when the crash happened:
>
> crash> struct eventfs_inode 00000000eada7680
> struct eventfs_inode {
> list = {
> next = 0x10f802da0,
> prev = 0x122
> },
> entries = 0x12c031328 <event_entries>,
> name = 0x12b90bbac <__tpstrtab_xfs_alloc_vextent_exact_bno> "xfs_alloc_vextent_exact_bno",
> children = {
> next = 0xeada76a0,
> prev = 0xeada76a0
> },
> dentry = 0x0,
> d_parent = 0x107c75d40,
> d_children = 0xeada5700,
> entry_attrs = 0x0,
> attr = {
> mode = 0,
> uid = {
> val = 0
> },
> gid = {
> val = 0
> }
> },
> data = 0xeada6660,
> {
> llist = {
> next = 0xeada7668
> },
> rcu = {
> next = 0xeada7668,
> func = 0x12ad2a5b8 <free_rcu_ei>
> }
> },
> is_freed = 1,
> nr_entries = 6
> }

Heiko, your analysis looks good to me. Seems -EEXIST is from:
https://elixir.bootlin.com/linux/v6.7-rc2/source/fs/tracefs/inode.c#L533

Steve, as per me error handling should be same for create_dir_dentry()
and create_file_dentry() or am I missing something.

-Ajay

2023-11-23 15:24:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode

On Thu, 23 Nov 2023 12:25:48 +0100
Heiko Carstens <[email protected]> wrote:

> So, if it helps (this still happens with Linus' master branch):
>
> create_dir_dentry() is called with a "struct eventfs_inode *ei" (second
> parameter), which points to a data structure where "is_freed" is 1. Then it
> looks like create_dir() returned "-EEXIST". And looking at the code this
> combination then must lead to d_invalidate() incorrectly being called with
> "-EEXIST" as dentry pointer.

I haven't looked too much at the error codes, let me do that on Monday
(it's currently Turkey weekend here in the US).

But could you test this branch:

git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git trace/core

I have a bunch of fixes in that branch that may fix your issue. I just
finished testing it and plan on pushing it to Linus before the next rc
release.

Thanks!

-- Steve


>
> Now, I have no idea how the code should work, but it is quite obvious that
> something is broken :)
>
> Here the dump of the struct eventfs_inode that was passed to
> create_file_dentry() when the crash happened:

2023-11-23 16:06:53

by Heiko Carstens

[permalink] [raw]
Subject: Re: [PATCH v5] eventfs: Remove eventfs_file and just use eventfs_inode

On Thu, Nov 23, 2023 at 10:23:49AM -0500, Steven Rostedt wrote:
> On Thu, 23 Nov 2023 12:25:48 +0100
> Heiko Carstens <[email protected]> wrote:
>
> > So, if it helps (this still happens with Linus' master branch):
> >
> > create_dir_dentry() is called with a "struct eventfs_inode *ei" (second
> > parameter), which points to a data structure where "is_freed" is 1. Then it
> > looks like create_dir() returned "-EEXIST". And looking at the code this
> > combination then must lead to d_invalidate() incorrectly being called with
> > "-EEXIST" as dentry pointer.
>
> I haven't looked too much at the error codes, let me do that on Monday
> (it's currently Turkey weekend here in the US).
>
> But could you test this branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git trace/core
>
> I have a bunch of fixes in that branch that may fix your issue. I just
> finished testing it and plan on pushing it to Linus before the next rc
> release.

This is not that easy to reproduce, however you branch contains commit
71cade82f2b5 ("eventfs: Do not invalidate dentry in create_file/dir_dentry()")
which removes the d_invalidate() call.
The crash I reported cannot happen anymore with that commit. I'll consider
this fixed, and report again if this (or something else) still causes
problems.

Thanks!