kernfs_dir_next_pos() overlooks the situation that the dentry
corresponding to a given pos object has already been inactive. Hence,
when kernfs_dir_pos() returns the dentry with a hash value larger than
the original one, kernfs_dir_next_pos() returns the dentry next to the
one returned by kernfs_dir_pos(). As a result, the dentry returned by
kernfs_dir_pos() is skipped.
To fix this issue, try to find a next node only when the returned
object is less than or equal to the original one.
Note that this implementation focuses on getting guarantee that the
existing nodes are never skipped, not interested in the other nodes
that are added or removed during the process.
We found this issue during a stress test that repeatedly reads
/sys/module directory to get a list of the currently loaded kernel
modules while repeatedly loading and unloading kernel modules
simultaneously.
v2: Fix the case where nodes with the same hash but with the name
larger than the original node could still be skipped. Use
kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
description.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
Suggested-by: Toshiyuki Okajima <[email protected]>
Cc: Eric W. Biederman <[email protected]>
---
fs/kernfs/dir.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..3aeeb7a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
{
+ struct kernfs_node *orig = pos;
+
pos = kernfs_dir_pos(ns, parent, ino, pos);
- if (pos) {
+ if (pos && kernfs_sd_compare(pos, orig) <= 0) {
do {
struct rb_node *node = rb_next(&pos->rb);
if (!node)
--
1.7.1
Hi,
I attach a reproducer for this issue.
Expand repro.tar.gz and run two commands:
# make
# ./repro.sh
Then, repro.sh will terminate when the issue is reproduced.
In this reproducer, we prepare AtvbAC0jwH.ko and U1cN9ZbARQ.ko having the same hash value.
Then, install U1cN9ZbARQ.ko and then repeating installing and removing AtvbAC0jwH.ko over and over.
Note that AtvbAC0jwH is smaller than U1cN9ZbARQ in the string comparison.
The issue is that output of ls -1 /sys/module contains NO U1cN9ZbARQ.ko.
I found a pair of AtvbAC0jwH and U1cN9ZbARQ using kernhash.c, retrieved from fs/kernfs/dirs.c,
contained in repro.tar.gz like:
# ./kernfshash
AtvbAC0jwH U1cN9ZbARQ 6b2609c5
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Hatayama, Daisuke
> Sent: Monday, May 28, 2018 9:54 PM
> To: '[email protected]' <[email protected]>;
> '[email protected]' <[email protected]>
> Cc: Okajima, Toshiyuki <[email protected]>;
> [email protected]; '[email protected]'
> <[email protected]>
> Subject: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
>
> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
>
> To fix this issue, try to find a next node only when the returned
> object is less than or equal to the original one.
>
> Note that this implementation focuses on getting guarantee that the
> existing nodes are never skipped, not interested in the other nodes
> that are added or removed during the process.
>
> We found this issue during a stress test that repeatedly reads
> /sys/module directory to get a list of the currently loaded kernel
> modules while repeatedly loading and unloading kernel modules
> simultaneously.
>
> v2: Fix the case where nodes with the same hash but with the name
> larger than the original node could still be skipped. Use
> kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
> description.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> Suggested-by: Toshiyuki Okajima <[email protected]>
> Cc: Eric W. Biederman <[email protected]>
> ---
> fs/kernfs/dir.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..3aeeb7a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
> static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> {
> + struct kernfs_node *orig = pos;
> +
> pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> + if (pos && kernfs_sd_compare(pos, orig) <= 0) {
> do {
> struct rb_node *node = rb_next(&pos->rb);
> if (!node)
> --
> 1.7.1
>
>
>
>
>
>
Hello,
On Mon, May 28, 2018 at 12:54:03PM +0000, Hatayama, Daisuke wrote:
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc1..3aeeb7a 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
> static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> {
> + struct kernfs_node *orig = pos;
> +
> pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> + if (pos && kernfs_sd_compare(pos, orig) <= 0) {
Hmm... the code seems a bit unintuitive to me and I wonder whether
it's because there are two identical skipping loops in
kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to
selectively disable one of them. Wouldn't it make more sense to get
rid of it from kernfs_dir_pos() and skip explicitly only when
necessary?
Thanks.
--
tejun
> -----Original Message-----
> From: Tejun Heo [mailto:[email protected]] On Behalf Of '[email protected]'
> Sent: Wednesday, May 30, 2018 1:26 AM
> To: Hatayama, Daisuke <[email protected]>
> Cc: '[email protected]' <[email protected]>; Okajima,
> Toshiyuki <[email protected]>;
> [email protected]; '[email protected]'
> <[email protected]>
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
>
> Hello,
>
> On Mon, May 28, 2018 at 12:54:03PM +0000, Hatayama, Daisuke wrote:
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 89d1dc1..3aeeb7a 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1621,8 +1621,10 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
> > static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> > struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> > {
> > + struct kernfs_node *orig = pos;
> > +
> > pos = kernfs_dir_pos(ns, parent, ino, pos);
> > - if (pos) {
> > + if (pos && kernfs_sd_compare(pos, orig) <= 0) {
>
> Hmm... the code seems a bit unintuitive to me and I wonder whether
> it's because there are two identical skipping loops in
> kernfs_dir_pos() and kernfs_dir_next_pos() and we're now trying to
> selectively disable one of them. Wouldn't it make more sense to get
> rid of it from kernfs_dir_pos() and skip explicitly only when
> necessary?
>
kernfs_dir_pos() checks if a kernfs_node object given as one of its
arguments is still active and if so returns it, or returns a
kernfs_node object that is most equal (possibly smaller and larger) to
the given object.
kernfs_dir_next_pos() returns a kernfs_node object that is next to the
object given by kernfs_dir_pos().
Two functions does different things and both need to skip inactive
nodes. I don't think it natural to remove the skip only from
kernfs_dir_pos().
OTOH, throughout getdents(), there is no case that the kernfs_node
object given to kernfs_dir_pos() is used afterwards in the
processing. So, is it enough to provide kernfs_dir_next_pos() only?
Then, the skip code is now not duplicated.
The patch below is my thought. How do you think?
But note that this patch has some bug so that system boot get hang
without detecting root filesystem disk :) I'm debugging this now.
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc1..140706f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,9 +1584,11 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
return 0;
}
-static struct kernfs_node *kernfs_dir_pos(const void *ns,
+static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
{
+ struct kernfs_node *orig = pos;
+
if (pos) {
int valid = kernfs_active(pos) &&
pos->parent == parent && hash == pos->hash;
@@ -1608,7 +1610,9 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
}
}
/* Skip over entries which are dying/dead or in the wrong namespace */
- while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+ while (pos && (!kernfs_active(pos) ||
+ (!orig || kernfs_sd_compare(pos, orig) <= 0) ||
+ pos->ns != ns)) {
struct rb_node *node = rb_next(&pos->rb);
if (!node)
pos = NULL;
@@ -1618,22 +1622,6 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
return pos;
}
-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
- struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
-{
- pos = kernfs_dir_pos(ns, parent, ino, pos);
- if (pos) {
- do {
- struct rb_node *node = rb_next(&pos->rb);
- if (!node)
- pos = NULL;
- else
- pos = rb_to_kn(node);
- } while (pos && (!kernfs_active(pos) || pos->ns != ns));
- }
- return pos;
-}
-
static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file->f_path.dentry;
@@ -1648,7 +1636,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
- for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
+ for (pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos);
pos;
pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
const char *name = pos->name;
Thanks.
HATAYAMA, Daisuke
Hello,
On Fri, Jun 01, 2018 at 09:25:32AM +0000, Hatayama, Daisuke wrote:
> kernfs_dir_pos() checks if a kernfs_node object given as one of its
> arguments is still active and if so returns it, or returns a
> kernfs_node object that is most equal (possibly smaller and larger) to
> the given object.
Sometimes they're duplicate operations tho, which is exactly the bug
the posted patch is trying to fix. What I'm suggesting is instead of
leaving both instances and skipping one conditionally, put them in one
place and trigger only when necessary. The sequence of operations
would be exactly the same. The only difference is how the code is
organized.
> kernfs_dir_next_pos() returns a kernfs_node object that is next to the
> object given by kernfs_dir_pos().
>
> Two functions does different things and both need to skip inactive
> nodes. I don't think it natural to remove the skip only from
> kernfs_dir_pos().
>
> OTOH, throughout getdents(), there is no case that the kernfs_node
> object given to kernfs_dir_pos() is used afterwards in the
> processing. So, is it enough to provide kernfs_dir_next_pos() only?
> Then, the skip code is now not duplicated.
>
> The patch below is my thought. How do you think?
>
> But note that this patch has some bug so that system boot get hang
> without detecting root filesystem disk :) I'm debugging this now.
I haven't looked into the code that closely but given that we had
cases where both skippings were fine and not, the condition is likely
gonna be a bit tricker?
Thanks.
--
tejun
"Hatayama, Daisuke" <[email protected]> writes:
> kernfs_dir_next_pos() overlooks the situation that the dentry
> corresponding to a given pos object has already been inactive. Hence,
> when kernfs_dir_pos() returns the dentry with a hash value larger than
> the original one, kernfs_dir_next_pos() returns the dentry next to the
> one returned by kernfs_dir_pos(). As a result, the dentry returned by
> kernfs_dir_pos() is skipped.
>
> To fix this issue, try to find a next node only when the returned
> object is less than or equal to the original one.
>
> Note that this implementation focuses on getting guarantee that the
> existing nodes are never skipped, not interested in the other nodes
> that are added or removed during the process.
>
> We found this issue during a stress test that repeatedly reads
> /sys/module directory to get a list of the currently loaded kernel
> modules while repeatedly loading and unloading kernel modules
> simultaneously.
>
> v2: Fix the case where nodes with the same hash but with the name
> larger than the original node could still be skipped. Use
> kernfs_sd_compare() to compare kernfs_node objects. Imporove patch
> description.
>
Semantically this looks like the right fix.
The deep bug is that kernfs_dir_pos can always advance the position,
and the code has never looked for or handled that case. AKA just the
rbtree walk is enough to advance the position.
That said your implementation is buggy. It is not safe to call
kernfs_sd_compare on orig as kernfs_put has already been called on orig
and thus orig may already be free.
I suggest moving the kernfs_put for orig into kernfs_fop_readdir,
just before the mutex_unlock calls. That makes your kernfs_sd_compare
safe.
That would then allow moving the code to skip the next entry to also
be vmoed into kernfs_fop_readir.
Perhaps something like this:
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..2d0f71ffb539 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,13 +1584,12 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
return 0;
}
-static struct kernfs_node *kernfs_dir_pos(const void *ns,
+static struct kernfs_node *kernfs_dir_pos(
struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
{
if (pos) {
int valid = kernfs_active(pos) &&
pos->parent == parent && hash == pos->hash;
- kernfs_put(pos);
if (!valid)
pos = NULL;
}
@@ -1607,8 +1606,14 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
break;
}
}
- /* Skip over entries which are dying/dead or in the wrong namespace */
- while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
+ return pos;
+}
+
+static struct kernfs_node *kernfs_dir_next_pos(
+ struct kernfs_node *parent, ino_t ino, struct kernfs_node *start)
+{
+ struct kernfs_node *pos = kernfs_dir_pos(parent, ino, start);
+ if (pos && (kernfs_sd_compare(pos, start) <= 0)) {
struct rb_node *node = rb_next(&pos->rb);
if (!node)
pos = NULL;
@@ -1618,27 +1623,11 @@ static struct kernfs_node *kernfs_dir_pos(const void *ns,
return pos;
}
-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
- struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
-{
- pos = kernfs_dir_pos(ns, parent, ino, pos);
- if (pos) {
- do {
- struct rb_node *node = rb_next(&pos->rb);
- if (!node)
- pos = NULL;
- else
- pos = rb_to_kn(node);
- } while (pos && (!kernfs_active(pos) || pos->ns != ns));
- }
- return pos;
-}
-
static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file->f_path.dentry;
struct kernfs_node *parent = kernfs_dentry_node(dentry);
- struct kernfs_node *pos = file->private_data;
+ struct kernfs_node *pos, *saved = file->private_data;
const void *ns = NULL;
if (!dir_emit_dots(file, ctx))
@@ -1648,23 +1637,30 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
- for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
+ for (pos = kernfs_dir_pos(parent, ctx->pos, saved);
pos;
- pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
+ pos = kernfs_dir_next_pos(parent, ctx->pos, pos)) {
const char *name = pos->name;
unsigned int type = dt_type(pos);
int len = strlen(name);
ino_t ino = pos->id.ino;
- ctx->pos = pos->hash;
- file->private_data = pos;
- kernfs_get(pos);
+ /* Skip entries not fit for userspace */
+ if (!kernfs_active(pos) || !(pos->ns != ns))
+ continue;
+
+ kernfs_put(saved);
+ saved = pos;
+ ctx->pos = saved->hash;
+ file->private_data = saved;
+ kernfs_get(saved);
mutex_unlock(&kernfs_mutex);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
mutex_lock(&kernfs_mutex);
}
+ kernfs_put(saved);
mutex_unlock(&kernfs_mutex);
file->private_data = NULL;
ctx->pos = INT_MAX;
Over the years two bugs have crept into the code that handled the last
returned kernfs directory being deleted, and handled general seeking
in kernfs directories. The result is that reading the /sys/module
directory while moduled are loading and unloading will sometimes
skip over a module that was present for the entire duration of
the directory read.
The first bug is that kernfs_dir_next_pos was advancing the position
in the directory even when kernfs_dir_pos had found the starting
kernfs directory entry was not usable. In this case kernfs_dir_pos is
guaranteed to return the directory entry past the starting directory
entry, making it so that advancing the directory position is wrong.
The second bug is that kernfs_dir_pos when the saved kernfs_node
did not validate, was not always returning a directory after
the the target position, but was sometimes returning the directory
entry just before the target position.
To correct these issues and to make the code easily readable and
comprehensible several cleanups are made.
- A function kernfs_dir_next is factored out to make it straight-forward
to find the next node in a kernfs directory.
- A function kernfs_dir_skip_pos is factored out to skip over directory
entries that should not be shown to userspace. This function is called
from kernfs_fop_readdir directly removing the complication of calling
it from the other directory advancement functions.
- The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved
and ensuring the directory position advancment functions can reference
the saved node without complications.
- The function kernfs_dir_next_pos is modified to only advance the directory
position in the common case when kernfs_dir_pos validates and returns
the starting kernfs node. For all other cases kernfs_dir_pos is guaranteed
to return a directory position in advance of the starting directory position.
- kernfs_dir_pos is given a significant make over. It's essense remains the
same but some siginficant changes were made.
+ The offset is validated to be a valid hash value at the very beginning.
The validation is updated to handle the fact that neither 0 nor 1 are
valid hash values.
+ An extra test is added at the end of the rbtree lookup to ensure
the returned position is at or beyond the target position.
+ kernfs_name_compare is used during the rbtree lookup instead of just comparing
the hash. This allows the the passed namespace parameter to be used, and
if it is available from the saved entry the old saved name as well.
+ The validation of the saved kernfs node now checks for two cases.
If the saved node is returnable.
If the name of the saved node is usable during lookup.
In net this should result in a easier to understand, and more correct
version of directory traversal for kernfs directories.
Reported-by: "Hatayama, Daisuke" <[email protected]>
Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir scalability v2")
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
Can you test this and please verify it fixes your issue?
fs/kernfs/dir.c | 109 ++++++++++++++++++++++++++++++++++----------------------
1 file changed, 67 insertions(+), 42 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..8148b5fec48d 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
return 0;
}
+static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
+{
+ struct rb_node *node = rb_next(&pos->rb);
+ return node ? rb_to_kn(node) : NULL;
+}
+
static struct kernfs_node *kernfs_dir_pos(const void *ns,
- struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
+ struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
{
- if (pos) {
- int valid = kernfs_active(pos) &&
- pos->parent == parent && hash == pos->hash;
- kernfs_put(pos);
- if (!valid)
- pos = NULL;
- }
- if (!pos && (hash > 1) && (hash < INT_MAX)) {
- struct rb_node *node = parent->dir.children.rb_node;
- while (node) {
- pos = rb_to_kn(node);
-
- if (hash < pos->hash)
- node = node->rb_left;
- else if (hash > pos->hash)
- node = node->rb_right;
- else
- break;
+ struct kernfs_node *pos;
+ struct rb_node *node;
+ unsigned int hash;
+ const char *name = "";
+
+ /* Is off a valid name hash? */
+ if ((off < 2) || (off >= INT_MAX))
+ return NULL;
+ hash = off;
+
+ /* Is the saved position usable? */
+ if (saved) {
+ /* Proper parent and hash? */
+ if ((parent != saved->parent) || (saved->hash != hash)) {
+ saved = NULL;
+ } else {
+ if (kernfs_active(saved))
+ return saved;
+ name = saved->name;
}
}
- /* Skip over entries which are dying/dead or in the wrong namespace */
- while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
- struct rb_node *node = rb_next(&pos->rb);
- if (!node)
- pos = NULL;
+
+ /* Find the closest pos to the hash we are looking for */
+ pos = NULL;
+ node = parent->dir.children.rb_node;
+ while (node) {
+ int result;
+
+ pos = rb_to_kn(node);
+ result = kernfs_name_compare(hash, name, ns, pos);
+ if (result < 0)
+ node = node->rb_left;
+ else if (result > 0)
+ node = node->rb_right;
else
- pos = rb_to_kn(node);
+ break;
}
+
+ /* Ensure pos is at or beyond the target position */
+ if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
+ pos = kernfs_dir_next(pos);
+
return pos;
}
static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
- struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
+ struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
{
- pos = kernfs_dir_pos(ns, parent, ino, pos);
- if (pos) {
- do {
- struct rb_node *node = rb_next(&pos->rb);
- if (!node)
- pos = NULL;
- else
- pos = rb_to_kn(node);
- } while (pos && (!kernfs_active(pos) || pos->ns != ns));
- }
+ struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
+ if (likely(pos == start))
+ pos = kernfs_dir_next(pos);
+ return pos;
+}
+
+static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
+ struct kernfs_node *pos)
+{
+ /* Skip entries we don't want to return to userspace */
+ while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
+ pos = kernfs_dir_next(pos);
return pos;
}
@@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
{
struct dentry *dentry = file->f_path.dentry;
struct kernfs_node *parent = kernfs_dentry_node(dentry);
- struct kernfs_node *pos = file->private_data;
+ struct kernfs_node *pos, *saved = file->private_data;
const void *ns = NULL;
if (!dir_emit_dots(file, ctx))
@@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
- for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
- pos;
+ for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
+ (pos = kernfs_dir_skip_pos(ns, pos));
pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
const char *name = pos->name;
unsigned int type = dt_type(pos);
int len = strlen(name);
ino_t ino = pos->id.ino;
- ctx->pos = pos->hash;
- file->private_data = pos;
- kernfs_get(pos);
+ kernfs_put(saved);
+ saved = pos;
+ ctx->pos = saved->hash;
+ file->private_data = saved;
+ kernfs_get(saved);
mutex_unlock(&kernfs_mutex);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
mutex_lock(&kernfs_mutex);
}
+ kernfs_put(saved);
mutex_unlock(&kernfs_mutex);
file->private_data = NULL;
ctx->pos = INT_MAX;
--
2.14.1
Hello,
Thanks for this patch, Eric.
> -----Original Message-----
> From: Eric W. Biederman [mailto:[email protected]]
> Sent: Monday, June 4, 2018 3:51 AM
> To: Hatayama, Daisuke <[email protected]>
> Cc: '[email protected]' <[email protected]>;
> '[email protected]' <[email protected]>; Okajima, Toshiyuki
> <[email protected]>; [email protected];
> '[email protected]' <[email protected]>
> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
>
>
> Over the years two bugs have crept into the code that handled the last
> returned kernfs directory being deleted, and handled general seeking
> in kernfs directories. The result is that reading the /sys/module
> directory while moduled are loading and unloading will sometimes
> skip over a module that was present for the entire duration of
> the directory read.
>
> The first bug is that kernfs_dir_next_pos was advancing the position
> in the directory even when kernfs_dir_pos had found the starting
> kernfs directory entry was not usable. In this case kernfs_dir_pos is
> guaranteed to return the directory entry past the starting directory
> entry, making it so that advancing the directory position is wrong.
>
> The second bug is that kernfs_dir_pos when the saved kernfs_node
> did not validate, was not always returning a directory after
> the the target position, but was sometimes returning the directory
> entry just before the target position.
>
> To correct these issues and to make the code easily readable and
> comprehensible several cleanups are made.
>
> - A function kernfs_dir_next is factored out to make it straight-forward
> to find the next node in a kernfs directory.
>
> - A function kernfs_dir_skip_pos is factored out to skip over directory
> entries that should not be shown to userspace. This function is called
> from kernfs_fop_readdir directly removing the complication of calling
> it from the other directory advancement functions.
>
> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
> to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved
> and ensuring the directory position advancment functions can reference
> the saved node without complications.
>
> - The function kernfs_dir_next_pos is modified to only advance the directory
> position in the common case when kernfs_dir_pos validates and returns
> the starting kernfs node. For all other cases kernfs_dir_pos is guaranteed
> to return a directory position in advance of the starting directory position.
>
> - kernfs_dir_pos is given a significant make over. It's essense remains the
> same but some siginficant changes were made.
>
> + The offset is validated to be a valid hash value at the very beginning.
> The validation is updated to handle the fact that neither 0 nor 1 are
> valid hash values.
>
> + An extra test is added at the end of the rbtree lookup to ensure
> the returned position is at or beyond the target position.
>
> + kernfs_name_compare is used during the rbtree lookup instead of just
> comparing
> the hash. This allows the the passed namespace parameter to be used, and
> if it is available from the saved entry the old saved name as well.
>
> + The validation of the saved kernfs node now checks for two cases.
> If the saved node is returnable.
> If the name of the saved node is usable during lookup.
>
> In net this should result in a easier to understand, and more correct
> version of directory traversal for kernfs directories.
>
> Reported-by: "Hatayama, Daisuke" <[email protected]>
> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
> scalability v2")
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>
> Can you test this and please verify it fixes your issue?
>
I tried this patch on top of v4.17 but the system fails to boot
without detecting root disks by dracut like this:
[ 196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 199.945999] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 200.490281] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 201.071177] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
[ 201.074958] dracut-initqueue[324]: Warning: Could not boot.
Starting Setup Virtual Console...
[ OK ] Started Setup Virtual Console.
[ 201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
[ 201.256378] audit: type=1131 audit(1528132266.260:13): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Starting Dracut Emergency Shell...
Warning: /dev/fedora/root does not exist
Warning: /dev/fedora/swap does not exist
Warning: /dev/mapper/fedora-root does not exist
Generating "/run/initramfs/rdsosreport.txt"
Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.
dracut:/# ls /sys/module
8139cp dm_bufio kernel psmouse ttm
8139too dm_mirror keyboard pstore uhci_hcd
8250 dm_mod kgdboc qemu_fw_cfg usbcore
acpi dm_snapshot kgdbts qxl usbhid
acpiphp drm libahci random uv_nmi
ahci drm_kms_helper libata rcupdate virtio
ata_generic dynamic_debug md_mod rcutree virtio_console
ata_piix edac_core mii rng_core virtio_pci
battery ehci_hcd module scsi_mod virtio_ring
block fb mousedev serio_raw vt
button firmware_class pata_acpi sg watchdog
configfs fscrypto pci_hotplug slab_common workqueue
cpufreq hid pcie_aspm spurious xhci_hcd
cpuidle hid_magicmouse pciehp sr_mod xz_dec
crc32c_intel hid_ntrig pcmcia srcutree zswap
cryptomgr i8042 pcmcia_core suspend
debug_core intel_idle printk sysrq
devres ipv6 processor tcp_cubic
dracut:/# lsmod
Module Size Used by
qxl 77824 1
drm_kms_helper 196608 1
ttm 126976 1
drm 454656 4 qxl,ttm
virtio_console 32768 0
ata_generic 16384 0
crc32c_intel 24576 0
8139too 40960 0
virtio_pci 28672 0
8139cp 32768 0
virtio_ring 28672 2 virtio_pci
serio_raw 16384 0
pata_acpi 16384 0
virtio 16384 2 virtio_pci
mii 16384 2 8139too
qemu_fw_cfg 16384 0
dracut:/#
OTOH, there's no issue on the pure v4.17 kernel.
As above, ls /sys/module looks apparently good. But I guess any part of
behavior of getdentries() on sysfs must have changed, affecting the disk
detection...
and,
> fs/kernfs/dir.c | 109
> ++++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 67 insertions(+), 42 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 89d1dc19340b..8148b5fec48d 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
> struct file *filp)
> return 0;
> }
>
> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> +{
> + struct rb_node *node = rb_next(&pos->rb);
> + return node ? rb_to_kn(node) : NULL;
> +}
> +
> static struct kernfs_node *kernfs_dir_pos(const void *ns,
> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
> {
> - if (pos) {
> - int valid = kernfs_active(pos) &&
> - pos->parent == parent && hash == pos->hash;
> - kernfs_put(pos);
> - if (!valid)
> - pos = NULL;
> - }
> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
> - struct rb_node *node = parent->dir.children.rb_node;
> - while (node) {
> - pos = rb_to_kn(node);
> -
> - if (hash < pos->hash)
> - node = node->rb_left;
> - else if (hash > pos->hash)
> - node = node->rb_right;
> - else
> - break;
> + struct kernfs_node *pos;
> + struct rb_node *node;
> + unsigned int hash;
> + const char *name = "";
> +
> + /* Is off a valid name hash? */
> + if ((off < 2) || (off >= INT_MAX))
> + return NULL;
> + hash = off;
> +
> + /* Is the saved position usable? */
> + if (saved) {
> + /* Proper parent and hash? */
> + if ((parent != saved->parent) || (saved->hash != hash)) {
> + saved = NULL;
name is uninitialized in this path.
> + } else {
> + if (kernfs_active(saved))
> + return saved;
> + name = saved->name;
> }
> }
> - /* Skip over entries which are dying/dead or in the wrong namespace
> */
> - while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> - struct rb_node *node = rb_next(&pos->rb);
> - if (!node)
> - pos = NULL;
> +
> + /* Find the closest pos to the hash we are looking for */
> + pos = NULL;
> + node = parent->dir.children.rb_node;
> + while (node) {
> + int result;
> +
> + pos = rb_to_kn(node);
> + result = kernfs_name_compare(hash, name, ns, pos);
> + if (result < 0)
> + node = node->rb_left;
> + else if (result > 0)
> + node = node->rb_right;
> else
> - pos = rb_to_kn(node);
> + break;
> }
> +
> + /* Ensure pos is at or beyond the target position */
> + if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> + pos = kernfs_dir_next(pos);
> +
Why not trying to find the final target object here?
Looking at the code, I think the operation needed here is to find the
smallest active object in the same namespace from the objects larger
than the saved object given as argument. The saved object appears to
be used as cache. I think dividing the process into kernfs_dir_pos()
is not necessary.
I mean like this:
# git diff
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8148b5f..eeffc8c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
/* Is the saved position usable? */
if (saved) {
+ name = saved->name;
/* Proper parent and hash? */
if ((parent != saved->parent) || (saved->hash != hash)) {
saved = NULL;
- } else {
- if (kernfs_active(saved))
- return saved;
- name = saved->name;
+ } else if (kernfs_active(saved)) {
+ pos = saved;
+ goto skip;
}
}
@@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
break;
}
+skip:
/* Ensure pos is at or beyond the target position */
- if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
+ if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
pos = kernfs_dir_next(pos);
return pos;
}
-static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
- struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
-{
- struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
- if (likely(pos == start))
- pos = kernfs_dir_next(pos);
- return pos;
-}
-
static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
struct kernfs_node *pos)
{
@@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
(pos = kernfs_dir_skip_pos(ns, pos));
- pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
+ pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
const char *name = pos->name;
unsigned int type = dt_type(pos);
int len = strlen(name);
> return pos;
> }
>
> static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> - struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> + struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> {
> - pos = kernfs_dir_pos(ns, parent, ino, pos);
> - if (pos) {
> - do {
> - struct rb_node *node = rb_next(&pos->rb);
> - if (!node)
> - pos = NULL;
> - else
> - pos = rb_to_kn(node);
> - } while (pos && (!kernfs_active(pos) || pos->ns != ns));
> - }
> + struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> + if (likely(pos == start))
> + pos = kernfs_dir_next(pos);
> + return pos;
> +}
> +
> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> + struct kernfs_node *pos)
> +{
> + /* Skip entries we don't want to return to userspace */
> + while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
> + pos = kernfs_dir_next(pos);
> return pos;
> }
>
> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct
> dir_context *ctx)
> {
> struct dentry *dentry = file->f_path.dentry;
> struct kernfs_node *parent = kernfs_dentry_node(dentry);
> - struct kernfs_node *pos = file->private_data;
> + struct kernfs_node *pos, *saved = file->private_data;
> const void *ns = NULL;
>
> if (!dir_emit_dots(file, ctx))
> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
> struct dir_context *ctx)
> if (kernfs_ns_enabled(parent))
> ns = kernfs_info(dentry->d_sb)->ns;
>
> - for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
> - pos;
> + for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> + (pos = kernfs_dir_skip_pos(ns, pos));
> pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> const char *name = pos->name;
> unsigned int type = dt_type(pos);
> int len = strlen(name);
> ino_t ino = pos->id.ino;
>
> - ctx->pos = pos->hash;
> - file->private_data = pos;
> - kernfs_get(pos);
> + kernfs_put(saved);
> + saved = pos;
> + ctx->pos = saved->hash;
> + file->private_data = saved;
> + kernfs_get(saved);
>
> mutex_unlock(&kernfs_mutex);
> if (!dir_emit(ctx, name, len, ino, type))
> return 0;
> mutex_lock(&kernfs_mutex);
> }
> + kernfs_put(saved);
> mutex_unlock(&kernfs_mutex);
> file->private_data = NULL;
> ctx->pos = INT_MAX;
> --
> 2.14.1
>
>
> -----Original Message-----
> From: Tejun Heo [mailto:[email protected]] On Behalf Of '[email protected]'
> Sent: Saturday, June 2, 2018 2:07 AM
> To: Hatayama, Daisuke <[email protected]>
> Cc: '[email protected]' <[email protected]>; Okajima,
> Toshiyuki <[email protected]>;
> [email protected]; '[email protected]'
> <[email protected]>
> Subject: Re: [RESEND PATCH v2] kernfs: fix dentry unexpected skip
>
> Hello,
>
> On Fri, Jun 01, 2018 at 09:25:32AM +0000, Hatayama, Daisuke wrote:
> > kernfs_dir_pos() checks if a kernfs_node object given as one of its
> > arguments is still active and if so returns it, or returns a
> > kernfs_node object that is most equal (possibly smaller and larger) to
> > the given object.
>
> Sometimes they're duplicate operations tho, which is exactly the bug
> the posted patch is trying to fix. What I'm suggesting is instead of
> leaving both instances and skipping one conditionally, put them in one
> place and trigger only when necessary. The sequence of operations
> would be exactly the same. The only difference is how the code is
> organized.
>
I see and I think Eric's patch is written as you suggest very well.
> > kernfs_dir_next_pos() returns a kernfs_node object that is next to the
> > object given by kernfs_dir_pos().
> >
> > Two functions does different things and both need to skip inactive
> > nodes. I don't think it natural to remove the skip only from
> > kernfs_dir_pos().
> >
> > OTOH, throughout getdents(), there is no case that the kernfs_node
> > object given to kernfs_dir_pos() is used afterwards in the
> > processing. So, is it enough to provide kernfs_dir_next_pos() only?
> > Then, the skip code is now not duplicated.
> >
> > The patch below is my thought. How do you think?
> >
> > But note that this patch has some bug so that system boot get hang
> > without detecting root filesystem disk :) I'm debugging this now.
>
> I haven't looked into the code that closely but given that we had
> cases where both skippings were fine and not, the condition is likely
> gonna be a bit tricker?
I agree to this version looks a bit tricker. But I think once the skipping
code is separated as Eric's patch, it would be resolved naturally.
>
> Thanks.
>
> --
> tejun
>
"Hatayama, Daisuke" <[email protected]> writes:
> Hello,
>
> Thanks for this patch, Eric.
>
>> -----Original Message-----
>> From: Eric W. Biederman [mailto:[email protected]]
>> Sent: Monday, June 4, 2018 3:51 AM
>> To: Hatayama, Daisuke <[email protected]>
>> Cc: '[email protected]' <[email protected]>;
>> '[email protected]' <[email protected]>; Okajima, Toshiyuki
>> <[email protected]>; [email protected];
>> '[email protected]' <[email protected]>
>> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
>>
>>
>> Over the years two bugs have crept into the code that handled the last
>> returned kernfs directory being deleted, and handled general seeking
>> in kernfs directories. The result is that reading the /sys/module
>> directory while moduled are loading and unloading will sometimes
>> skip over a module that was present for the entire duration of
>> the directory read.
>>
>> The first bug is that kernfs_dir_next_pos was advancing the position
>> in the directory even when kernfs_dir_pos had found the starting
>> kernfs directory entry was not usable. In this case kernfs_dir_pos is
>> guaranteed to return the directory entry past the starting directory
>> entry, making it so that advancing the directory position is wrong.
>>
>> The second bug is that kernfs_dir_pos when the saved kernfs_node
>> did not validate, was not always returning a directory after
>> the the target position, but was sometimes returning the directory
>> entry just before the target position.
>>
>> To correct these issues and to make the code easily readable and
>> comprehensible several cleanups are made.
>>
>> - A function kernfs_dir_next is factored out to make it straight-forward
>> to find the next node in a kernfs directory.
>>
>> - A function kernfs_dir_skip_pos is factored out to skip over directory
>> entries that should not be shown to userspace. This function is called
>> from kernfs_fop_readdir directly removing the complication of calling
>> it from the other directory advancement functions.
>>
>> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
>> to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved
>> and ensuring the directory position advancment functions can reference
>> the saved node without complications.
>>
>> - The function kernfs_dir_next_pos is modified to only advance the directory
>> position in the common case when kernfs_dir_pos validates and returns
>> the starting kernfs node. For all other cases kernfs_dir_pos is guaranteed
>> to return a directory position in advance of the starting directory position.
>>
>> - kernfs_dir_pos is given a significant make over. It's essense remains the
>> same but some siginficant changes were made.
>>
>> + The offset is validated to be a valid hash value at the very beginning.
>> The validation is updated to handle the fact that neither 0 nor 1 are
>> valid hash values.
>>
>> + An extra test is added at the end of the rbtree lookup to ensure
>> the returned position is at or beyond the target position.
>>
>> + kernfs_name_compare is used during the rbtree lookup instead of just
>> comparing
>> the hash. This allows the the passed namespace parameter to be used, and
>> if it is available from the saved entry the old saved name as well.
>>
>> + The validation of the saved kernfs node now checks for two cases.
>> If the saved node is returnable.
>> If the name of the saved node is usable during lookup.
>>
>> In net this should result in a easier to understand, and more correct
>> version of directory traversal for kernfs directories.
>>
>> Reported-by: "Hatayama, Daisuke" <[email protected]>
>> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
>> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
>> scalability v2")
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>>
>> Can you test this and please verify it fixes your issue?
>>
>
> I tried this patch on top of v4.17 but the system fails to boot
> without detecting root disks by dracut like this:
>
> [ 196.121294] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 196.672175] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 197.219519] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 197.768997] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 198.312498] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 198.856841] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 199.403190] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 199.945999] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 200.490281] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 201.071177] dracut-initqueue[324]: Warning: dracut-initqueue timeout - starting timeout scripts
> [ 201.074958] dracut-initqueue[324]: Warning: Could not boot.
> Starting Setup Virtual Console...
> [ OK ] Started Setup Virtual Console.
> [ 201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> [ 201.256378] audit: type=1131 audit(1528132266.260:13): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> Starting Dracut Emergency Shell...
> Warning: /dev/fedora/root does not exist
> Warning: /dev/fedora/swap does not exist
> Warning: /dev/mapper/fedora-root does not exist
>
> Generating "/run/initramfs/rdsosreport.txt"
>
>
> Entering emergency mode. Exit the shell to continue.
> Type "journalctl" to view system logs.
> You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
> after mounting them and attach it to a bug report.
>
>
> dracut:/# ls /sys/module
> 8139cp dm_bufio kernel psmouse ttm
> 8139too dm_mirror keyboard pstore uhci_hcd
> 8250 dm_mod kgdboc qemu_fw_cfg usbcore
> acpi dm_snapshot kgdbts qxl usbhid
> acpiphp drm libahci random uv_nmi
> ahci drm_kms_helper libata rcupdate virtio
> ata_generic dynamic_debug md_mod rcutree virtio_console
> ata_piix edac_core mii rng_core virtio_pci
> battery ehci_hcd module scsi_mod virtio_ring
> block fb mousedev serio_raw vt
> button firmware_class pata_acpi sg watchdog
> configfs fscrypto pci_hotplug slab_common workqueue
> cpufreq hid pcie_aspm spurious xhci_hcd
> cpuidle hid_magicmouse pciehp sr_mod xz_dec
> crc32c_intel hid_ntrig pcmcia srcutree zswap
> cryptomgr i8042 pcmcia_core suspend
> debug_core intel_idle printk sysrq
> devres ipv6 processor tcp_cubic
> dracut:/# lsmod
> Module Size Used by
> qxl 77824 1
> drm_kms_helper 196608 1
> ttm 126976 1
> drm 454656 4 qxl,ttm
> virtio_console 32768 0
> ata_generic 16384 0
> crc32c_intel 24576 0
> 8139too 40960 0
> virtio_pci 28672 0
> 8139cp 32768 0
> virtio_ring 28672 2 virtio_pci
> serio_raw 16384 0
> pata_acpi 16384 0
> virtio 16384 2 virtio_pci
> mii 16384 2 8139too
> qemu_fw_cfg 16384 0
> dracut:/#
>
> OTOH, there's no issue on the pure v4.17 kernel.
>
> As above, ls /sys/module looks apparently good. But I guess any part of
> behavior of getdentries() on sysfs must have changed, affecting the disk
> detection...
I haven't been able to reproduce this yet. My test system boots fine.
Which fedora are you testing on?
>> fs/kernfs/dir.c | 109
>> ++++++++++++++++++++++++++++++++++----------------------
>> 1 file changed, 67 insertions(+), 42 deletions(-)
>>
>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> index 89d1dc19340b..8148b5fec48d 100644
>> --- a/fs/kernfs/dir.c
>> +++ b/fs/kernfs/dir.c
>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
>> struct file *filp)
>> return 0;
>> }
>>
>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
>> +{
>> + struct rb_node *node = rb_next(&pos->rb);
>> + return node ? rb_to_kn(node) : NULL;
>> +}
>> +
>> static struct kernfs_node *kernfs_dir_pos(const void *ns,
>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
>> {
>> - if (pos) {
>> - int valid = kernfs_active(pos) &&
>> - pos->parent == parent && hash == pos->hash;
>> - kernfs_put(pos);
>> - if (!valid)
>> - pos = NULL;
>> - }
>> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
>> - struct rb_node *node = parent->dir.children.rb_node;
>> - while (node) {
>> - pos = rb_to_kn(node);
>> -
>> - if (hash < pos->hash)
>> - node = node->rb_left;
>> - else if (hash > pos->hash)
>> - node = node->rb_right;
>> - else
>> - break;
>> + struct kernfs_node *pos;
>> + struct rb_node *node;
>> + unsigned int hash;
>> + const char *name = "";
>> +
>> + /* Is off a valid name hash? */
>> + if ((off < 2) || (off >= INT_MAX))
>> + return NULL;
>> + hash = off;
>> +
>> + /* Is the saved position usable? */
>> + if (saved) {
>> + /* Proper parent and hash? */
>> + if ((parent != saved->parent) || (saved->hash != hash)) {
>> + saved = NULL;
>
> name is uninitialized in this path.
It is. name is initialized to "" see above.
>> + } else {
>> + if (kernfs_active(saved))
>> + return saved;
>> + name = saved->name;
>> }
>> }
>> - /* Skip over entries which are dying/dead or in the wrong namespace
>> */
>> - while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
>> - struct rb_node *node = rb_next(&pos->rb);
>> - if (!node)
>> - pos = NULL;
>> +
>> + /* Find the closest pos to the hash we are looking for */
>> + pos = NULL;
>> + node = parent->dir.children.rb_node;
>> + while (node) {
>> + int result;
>> +
>> + pos = rb_to_kn(node);
>> + result = kernfs_name_compare(hash, name, ns, pos);
>> + if (result < 0)
>> + node = node->rb_left;
>> + else if (result > 0)
>> + node = node->rb_right;
>> else
>> - pos = rb_to_kn(node);
>> + break;
>> }
>> +
>> + /* Ensure pos is at or beyond the target position */
>> + if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
>> + pos = kernfs_dir_next(pos);
>> +
>
> Why not trying to find the final target object here?
>
> Looking at the code, I think the operation needed here is to find the
> smallest active object in the same namespace from the objects larger
> than the saved object given as argument. The saved object appears to
> be used as cache. I think dividing the process into kernfs_dir_pos()
> is not necessary.
What you are suggesting below is wrong when restarting readdir.
Ordinarily saved is the next entry we are going to return from readdir.
When dir_emit does not succeed we stop returnning entries and when
we restart readdir we need to attempt dir_emit on saved again.
Always advancing past saved will cause us to skip saved in the
event a single readdir is not enough.
The restart (kernfs_dir_pos) must return saved or if saved is now gone,
the first entry past saved.
Saved dying in the middle is what I believe caused the original issue.
Making all of this clear is part of the reason I moved the
kernfs_dir_skip_pos logic into it's own function.
Eric
>
> I mean like this:
>
> # git diff
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 8148b5f..eeffc8c 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
>
> /* Is the saved position usable? */
> if (saved) {
> + name = saved->name;
> /* Proper parent and hash? */
> if ((parent != saved->parent) || (saved->hash != hash)) {
> saved = NULL;
> - } else {
> - if (kernfs_active(saved))
> - return saved;
> - name = saved->name;
> + } else if (kernfs_active(saved)) {
> + pos = saved;
> + goto skip;
> }
> }
>
> @@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode *inode, struct file *filp)
> break;
> }
>
> +skip:
> /* Ensure pos is at or beyond the target position */
> - if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> + if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
> pos = kernfs_dir_next(pos);
>
> return pos;
> }
>
> -static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> - struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> -{
> - struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> - if (likely(pos == start))
> - pos = kernfs_dir_next(pos);
> - return pos;
> -}
> -
> static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> struct kernfs_node *pos)
> {
> @@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
>
> for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> (pos = kernfs_dir_skip_pos(ns, pos));
> - pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> + pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
> const char *name = pos->name;
> unsigned int type = dt_type(pos);
> int len = strlen(name);
>
>> return pos;
>> }
>>
>> static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
>> - struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
>> {
>> - pos = kernfs_dir_pos(ns, parent, ino, pos);
>> - if (pos) {
>> - do {
>> - struct rb_node *node = rb_next(&pos->rb);
>> - if (!node)
>> - pos = NULL;
>> - else
>> - pos = rb_to_kn(node);
>> - } while (pos && (!kernfs_active(pos) || pos->ns != ns));
>> - }
>> + struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
>> + if (likely(pos == start))
>> + pos = kernfs_dir_next(pos);
>> + return pos;
>> +}
>> +
>> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
>> + struct kernfs_node *pos)
>> +{
>> + /* Skip entries we don't want to return to userspace */
>> + while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
>> + pos = kernfs_dir_next(pos);
>> return pos;
>> }
>>
>> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file, struct
>> dir_context *ctx)
>> {
>> struct dentry *dentry = file->f_path.dentry;
>> struct kernfs_node *parent = kernfs_dentry_node(dentry);
>> - struct kernfs_node *pos = file->private_data;
>> + struct kernfs_node *pos, *saved = file->private_data;
>> const void *ns = NULL;
>>
>> if (!dir_emit_dots(file, ctx))
>> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
>> struct dir_context *ctx)
>> if (kernfs_ns_enabled(parent))
>> ns = kernfs_info(dentry->d_sb)->ns;
>>
>> - for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
>> - pos;
>> + for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
>> + (pos = kernfs_dir_skip_pos(ns, pos));
>> pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
>> const char *name = pos->name;
>> unsigned int type = dt_type(pos);
>> int len = strlen(name);
>> ino_t ino = pos->id.ino;
>>
>> - ctx->pos = pos->hash;
>> - file->private_data = pos;
>> - kernfs_get(pos);
>> + kernfs_put(saved);
>> + saved = pos;
>> + ctx->pos = saved->hash;
>> + file->private_data = saved;
>> + kernfs_get(saved);
>>
>> mutex_unlock(&kernfs_mutex);
>> if (!dir_emit(ctx, name, len, ino, type))
>> return 0;
>> mutex_lock(&kernfs_mutex);
>> }
>> + kernfs_put(saved);
>> mutex_unlock(&kernfs_mutex);
>> file->private_data = NULL;
>> ctx->pos = INT_MAX;
>> --
>> 2.14.1
>>
>>
[email protected] (Eric W. Biederman) writes:
> "Hatayama, Daisuke" <[email protected]> writes:
>
>>> Can you test this and please verify it fixes your issue?
>>
>> I tried this patch on top of v4.17 but the system fails to boot
>> without detecting root disks by dracut like this:
[snip]
>> OTOH, there's no issue on the pure v4.17 kernel.
>>
>> As above, ls /sys/module looks apparently good. But I guess any part of
>> behavior of getdentries() on sysfs must have changed, affecting the disk
>> detection...
>
> I haven't been able to reproduce this yet. My test system boots fine.
> Which fedora are you testing on?
I reproduced something similar and fedora 28. So I think I have found
and fixed the issue. I believe I simply reversed the test at the end of
kernfs_dir_pos. AKA "<" instead of ">".
I am going to see if I can test my changes more throughly on this side
and then repost.
>>> fs/kernfs/dir.c | 109
>>> ++++++++++++++++++++++++++++++++++----------------------
>>> 1 file changed, 67 insertions(+), 42 deletions(-)
>>>
>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>> index 89d1dc19340b..8148b5fec48d 100644
>>> --- a/fs/kernfs/dir.c
>>> +++ b/fs/kernfs/dir.c
>>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode *inode,
>>> struct file *filp)
>>> return 0;
>>> }
>>>
>>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
>>> +{
>>> + struct rb_node *node = rb_next(&pos->rb);
>>> + return node ? rb_to_kn(node) : NULL;
>>> +}
>>> +
>>> static struct kernfs_node *kernfs_dir_pos(const void *ns,
>>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
>>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
>>> {
>>> - if (pos) {
>>> - int valid = kernfs_active(pos) &&
>>> - pos->parent == parent && hash == pos->hash;
>>> - kernfs_put(pos);
>>> - if (!valid)
>>> - pos = NULL;
>>> - }
>>> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
>>> - struct rb_node *node = parent->dir.children.rb_node;
>>> - while (node) {
>>> - pos = rb_to_kn(node);
>>> -
>>> - if (hash < pos->hash)
>>> - node = node->rb_left;
>>> - else if (hash > pos->hash)
>>> - node = node->rb_right;
>>> - else
>>> - break;
>>> + struct kernfs_node *pos;
>>> + struct rb_node *node;
>>> + unsigned int hash;
>>> + const char *name = "";
>>> +
>>> + /* Is off a valid name hash? */
>>> + if ((off < 2) || (off >= INT_MAX))
>>> + return NULL;
>>> + hash = off;
>>> +
>>> + /* Is the saved position usable? */
>>> + if (saved) {
>>> + /* Proper parent and hash? */
>>> + if ((parent != saved->parent) || (saved->hash != hash)) {
>>> + saved = NULL;
>>
>> name is uninitialized in this path.
>
> It is. name is initialized to "" see above.
>
>>> + } else {
>>> + if (kernfs_active(saved))
>>> + return saved;
>>> + name = saved->name;
>>> }
>>> }
>>> - /* Skip over entries which are dying/dead or in the wrong namespace
>>> */
>>> - while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
>>> - struct rb_node *node = rb_next(&pos->rb);
>>> - if (!node)
>>> - pos = NULL;
>>> +
>>> + /* Find the closest pos to the hash we are looking for */
>>> + pos = NULL;
>>> + node = parent->dir.children.rb_node;
>>> + while (node) {
>>> + int result;
>>> +
>>> + pos = rb_to_kn(node);
>>> + result = kernfs_name_compare(hash, name, ns, pos);
>>> + if (result < 0)
>>> + node = node->rb_left;
>>> + else if (result > 0)
>>> + node = node->rb_right;
>>> else
>>> - pos = rb_to_kn(node);
>>> + break;
>>> }
>>> +
>>> + /* Ensure pos is at or beyond the target position */
>>> + if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
^^^^^^^^^^^^^^^^
should be > 0
>>> + pos = kernfs_dir_next(pos);
>>> +
Eric
> -----Original Message-----
> From: Eric W. Biederman [mailto:[email protected]]
> Sent: Monday, June 4, 2018 11:45 PM
> To: Hatayama, Daisuke <[email protected]>
> Cc: '[email protected]' <[email protected]>;
> '[email protected]' <[email protected]>; Okajima, Toshiyuki
> <[email protected]>; [email protected];
> '[email protected]' <[email protected]>
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
>
> "Hatayama, Daisuke" <[email protected]> writes:
>
> > Hello,
> >
> > Thanks for this patch, Eric.
> >
> >> -----Original Message-----
> >> From: Eric W. Biederman [mailto:[email protected]]
> >> Sent: Monday, June 4, 2018 3:51 AM
> >> To: Hatayama, Daisuke <[email protected]>
> >> Cc: '[email protected]' <[email protected]>;
> >> '[email protected]' <[email protected]>; Okajima, Toshiyuki
> >> <[email protected]>; [email protected];
> >> '[email protected]' <[email protected]>
> >> Subject: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
> >>
> >>
> >> Over the years two bugs have crept into the code that handled the last
> >> returned kernfs directory being deleted, and handled general seeking
> >> in kernfs directories. The result is that reading the /sys/module
> >> directory while moduled are loading and unloading will sometimes
> >> skip over a module that was present for the entire duration of
> >> the directory read.
> >>
> >> The first bug is that kernfs_dir_next_pos was advancing the position
> >> in the directory even when kernfs_dir_pos had found the starting
> >> kernfs directory entry was not usable. In this case kernfs_dir_pos is
> >> guaranteed to return the directory entry past the starting directory
> >> entry, making it so that advancing the directory position is wrong.
> >>
> >> The second bug is that kernfs_dir_pos when the saved kernfs_node
> >> did not validate, was not always returning a directory after
> >> the the target position, but was sometimes returning the directory
> >> entry just before the target position.
> >>
> >> To correct these issues and to make the code easily readable and
> >> comprehensible several cleanups are made.
> >>
> >> - A function kernfs_dir_next is factored out to make it straight-forward
> >> to find the next node in a kernfs directory.
> >>
> >> - A function kernfs_dir_skip_pos is factored out to skip over directory
> >> entries that should not be shown to userspace. This function is called
> >> from kernfs_fop_readdir directly removing the complication of calling
> >> it from the other directory advancement functions.
> >>
> >> - The kernfs_put of the saved directory entry is moved from kernfs_dir_pos
> >> to kernfs_fop_readdir. Keeping it with the kernfs_get when it is saved
> >> and ensuring the directory position advancment functions can reference
> >> the saved node without complications.
> >>
> >> - The function kernfs_dir_next_pos is modified to only advance the directory
> >> position in the common case when kernfs_dir_pos validates and returns
> >> the starting kernfs node. For all other cases kernfs_dir_pos is
> guaranteed
> >> to return a directory position in advance of the starting directory
> position.
> >>
> >> - kernfs_dir_pos is given a significant make over. It's essense remains
> the
> >> same but some siginficant changes were made.
> >>
> >> + The offset is validated to be a valid hash value at the very beginning.
> >> The validation is updated to handle the fact that neither 0 nor 1 are
> >> valid hash values.
> >>
> >> + An extra test is added at the end of the rbtree lookup to ensure
> >> the returned position is at or beyond the target position.
> >>
> >> + kernfs_name_compare is used during the rbtree lookup instead of just
> >> comparing
> >> the hash. This allows the the passed namespace parameter to be used,
> and
> >> if it is available from the saved entry the old saved name as well.
> >>
> >> + The validation of the saved kernfs node now checks for two cases.
> >> If the saved node is returnable.
> >> If the name of the saved node is usable during lookup.
> >>
> >> In net this should result in a easier to understand, and more correct
> >> version of directory traversal for kernfs directories.
> >>
> >> Reported-by: "Hatayama, Daisuke" <[email protected]>
> >> Fixes: a406f75840e1 ("sysfs: use rb-tree for inode number lookup")
> >> Fixes: 1e5289c97bba ("sysfs: Cache the last sysfs_dirent to improve readdir
> >> scalability v2")
> >> Signed-off-by: "Eric W. Biederman" <[email protected]>
> >> ---
> >>
> >> Can you test this and please verify it fixes your issue?
> >>
> >
> > I tried this patch on top of v4.17 but the system fails to boot
> > without detecting root disks by dracut like this:
> >
> > [ 196.121294] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 196.672175] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 197.219519] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 197.768997] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 198.312498] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 198.856841] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 199.403190] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 199.945999] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 200.490281] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 201.071177] dracut-initqueue[324]: Warning: dracut-initqueue
> timeout - starting timeout scripts
> > [ 201.074958] dracut-initqueue[324]: Warning: Could not boot.
> > Starting Setup Virtual Console...
> > [ OK ] Started Setup Virtual Console.
> > [ 201.245921] audit: type=1130 audit(1528132266.260:12): pid=1 uid=0
> auid=4294967295 ses=4294967295 subj=kernel msg='unit=systemd-vconsole-setup
> comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=?
> res=success'
> > [ 201.256378] audit: type=1131 audit(1528132266.260:13):
> pid=1 uid=0 auid=4294967295 ses=4294967295 subj=kernel
> msg='unit=systemd-vconsole-setup comm="systemd"
> exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
> > Starting Dracut Emergency Shell...
> > Warning: /dev/fedora/root does not exist
> > Warning: /dev/fedora/swap does not exist
> > Warning: /dev/mapper/fedora-root does not exist
> >
> > Generating "/run/initramfs/rdsosreport.txt"
> >
> >
> > Entering emergency mode. Exit the shell to continue.
> > Type "journalctl" to view system logs.
> > You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick
> or /boot
> > after mounting them and attach it to a bug report.
> >
> >
> > dracut:/# ls /sys/module
> > 8139cp dm_bufio kernel psmouse ttm
> > 8139too dm_mirror keyboard pstore uhci_hcd
> > 8250 dm_mod kgdboc qemu_fw_cfg usbcore
> > acpi dm_snapshot kgdbts qxl usbhid
> > acpiphp drm libahci random uv_nmi
> > ahci drm_kms_helper libata rcupdate virtio
> > ata_generic dynamic_debug md_mod rcutree
> virtio_console
> > ata_piix edac_core mii rng_core virtio_pci
> > battery ehci_hcd module scsi_mod virtio_ring
> > block fb mousedev serio_raw vt
> > button firmware_class pata_acpi sg watchdog
> > configfs fscrypto pci_hotplug slab_common workqueue
> > cpufreq hid pcie_aspm spurious xhci_hcd
> > cpuidle hid_magicmouse pciehp sr_mod xz_dec
> > crc32c_intel hid_ntrig pcmcia srcutree zswap
> > cryptomgr i8042 pcmcia_core suspend
> > debug_core intel_idle printk sysrq
> > devres ipv6 processor tcp_cubic
> > dracut:/# lsmod
> > Module Size Used by
> > qxl 77824 1
> > drm_kms_helper 196608 1
> > ttm 126976 1
> > drm 454656 4 qxl,ttm
> > virtio_console 32768 0
> > ata_generic 16384 0
> > crc32c_intel 24576 0
> > 8139too 40960 0
> > virtio_pci 28672 0
> > 8139cp 32768 0
> > virtio_ring 28672 2 virtio_pci
> > serio_raw 16384 0
> > pata_acpi 16384 0
> > virtio 16384 2 virtio_pci
> > mii 16384 2 8139too
> > qemu_fw_cfg 16384 0
> > dracut:/#
> >
> > OTOH, there's no issue on the pure v4.17 kernel.
> >
> > As above, ls /sys/module looks apparently good. But I guess any part of
> > behavior of getdentries() on sysfs must have changed, affecting the disk
> > detection...
>
> I haven't been able to reproduce this yet. My test system boots fine.
> Which fedora are you testing on?
>
>
> >> fs/kernfs/dir.c | 109
> >> ++++++++++++++++++++++++++++++++++----------------------
> >> 1 file changed, 67 insertions(+), 42 deletions(-)
> >>
> >> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> >> index 89d1dc19340b..8148b5fec48d 100644
> >> --- a/fs/kernfs/dir.c
> >> +++ b/fs/kernfs/dir.c
> >> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode
> *inode,
> >> struct file *filp)
> >> return 0;
> >> }
> >>
> >> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> >> +{
> >> + struct rb_node *node = rb_next(&pos->rb);
> >> + return node ? rb_to_kn(node) : NULL;
> >> +}
> >> +
> >> static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> >> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
> >> {
> >> - if (pos) {
> >> - int valid = kernfs_active(pos) &&
> >> - pos->parent == parent && hash == pos->hash;
> >> - kernfs_put(pos);
> >> - if (!valid)
> >> - pos = NULL;
> >> - }
> >> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
> >> - struct rb_node *node = parent->dir.children.rb_node;
> >> - while (node) {
> >> - pos = rb_to_kn(node);
> >> -
> >> - if (hash < pos->hash)
> >> - node = node->rb_left;
> >> - else if (hash > pos->hash)
> >> - node = node->rb_right;
> >> - else
> >> - break;
> >> + struct kernfs_node *pos;
> >> + struct rb_node *node;
> >> + unsigned int hash;
> >> + const char *name = "";
> >> +
> >> + /* Is off a valid name hash? */
> >> + if ((off < 2) || (off >= INT_MAX))
> >> + return NULL;
> >> + hash = off;
> >> +
> >> + /* Is the saved position usable? */
> >> + if (saved) {
> >> + /* Proper parent and hash? */
> >> + if ((parent != saved->parent) || (saved->hash != hash)) {
> >> + saved = NULL;
> >
> > name is uninitialized in this path.
>
> It is. name is initialized to "" see above.
>
Or when either of the conditions is true, it has resulted in some inconsistent state, right?
So, why not terminating this session of readdir() immediately by returning NULL just as when off is turned out to be invalid?
> >> + } else {
> >> + if (kernfs_active(saved))
> >> + return saved;
> >> + name = saved->name;
> >> }
> >> }
> >> - /* Skip over entries which are dying/dead or in the wrong namespace
> >> */
> >> - while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> >> - struct rb_node *node = rb_next(&pos->rb);
> >> - if (!node)
> >> - pos = NULL;
> >> +
> >> + /* Find the closest pos to the hash we are looking for */
> >> + pos = NULL;
> >> + node = parent->dir.children.rb_node;
> >> + while (node) {
> >> + int result;
> >> +
> >> + pos = rb_to_kn(node);
> >> + result = kernfs_name_compare(hash, name, ns, pos);
> >> + if (result < 0)
> >> + node = node->rb_left;
> >> + else if (result > 0)
> >> + node = node->rb_right;
> >> else
> >> - pos = rb_to_kn(node);
> >> + break;
> >> }
> >> +
> >> + /* Ensure pos is at or beyond the target position */
> >> + if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> >> + pos = kernfs_dir_next(pos);
> >> +
> >
> > Why not trying to find the final target object here?
> >
> > Looking at the code, I think the operation needed here is to find the
> > smallest active object in the same namespace from the objects larger
> > than the saved object given as argument. The saved object appears to
> > be used as cache. I think dividing the process into kernfs_dir_pos()
> > is not necessary.
>
> What you are suggesting below is wrong when restarting readdir.
>
> Ordinarily saved is the next entry we are going to return from readdir.
> When dir_emit does not succeed we stop returnning entries and when
> we restart readdir we need to attempt dir_emit on saved again.
>
> Always advancing past saved will cause us to skip saved in the
> event a single readdir is not enough.
>
> The restart (kernfs_dir_pos) must return saved or if saved is now gone,
> the first entry past saved.
>
> Saved dying in the middle is what I believe caused the original issue.
>
> Making all of this clear is part of the reason I moved the
> kernfs_dir_skip_pos logic into it's own function.
>
> Eric
Thanks for the explanation. I'm sure why my suggestion doesn't work.
>
> >
> > I mean like this:
> >
> > # git diff
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 8148b5f..eeffc8c 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1605,13 +1605,13 @@ static int kernfs_dir_fop_release(struct inode
> *inode, struct file *filp)
> >
> > /* Is the saved position usable? */
> > if (saved) {
> > + name = saved->name;
> > /* Proper parent and hash? */
> > if ((parent != saved->parent) || (saved->hash != hash)) {
> > saved = NULL;
> > - } else {
> > - if (kernfs_active(saved))
> > - return saved;
> > - name = saved->name;
> > + } else if (kernfs_active(saved)) {
> > + pos = saved;
> > + goto skip;
> > }
> > }
> >
> > @@ -1631,22 +1631,14 @@ static int kernfs_dir_fop_release(struct inode
> *inode, struct file *filp)
> > break;
> > }
> >
> > +skip:
> > /* Ensure pos is at or beyond the target position */
> > - if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> > + if (pos && (kernfs_name_compare(hash, name, ns, pos) <= 0))
> > pos = kernfs_dir_next(pos);
> >
> > return pos;
> > }
> >
> > -static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> > - struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> > -{
> > - struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> > - if (likely(pos == start))
> > - pos = kernfs_dir_next(pos);
> > - return pos;
> > -}
> > -
> > static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> > struct kernfs_node *pos)
> > {
> > @@ -1672,7 +1664,7 @@ static int kernfs_fop_readdir(struct file *file, struct
> dir_context *ctx)
> >
> > for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> > (pos = kernfs_dir_skip_pos(ns, pos));
> > - pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> > + pos = kernfs_dir_pos(ns, parent, ctx->pos, pos)) {
> > const char *name = pos->name;
> > unsigned int type = dt_type(pos);
> > int len = strlen(name);
> >
> >> return pos;
> >> }
> >>
> >> static struct kernfs_node *kernfs_dir_next_pos(const void *ns,
> >> - struct kernfs_node *parent, ino_t ino, struct kernfs_node *pos)
> >> + struct kernfs_node *parent, loff_t off, struct kernfs_node *start)
> >> {
> >> - pos = kernfs_dir_pos(ns, parent, ino, pos);
> >> - if (pos) {
> >> - do {
> >> - struct rb_node *node = rb_next(&pos->rb);
> >> - if (!node)
> >> - pos = NULL;
> >> - else
> >> - pos = rb_to_kn(node);
> >> - } while (pos && (!kernfs_active(pos) || pos->ns != ns));
> >> - }
> >> + struct kernfs_node *pos = kernfs_dir_pos(ns, parent, off, start);
> >> + if (likely(pos == start))
> >> + pos = kernfs_dir_next(pos);
> >> + return pos;
> >> +}
> >> +
> >> +static struct kernfs_node *kernfs_dir_skip_pos(const void *ns,
> >> + struct kernfs_node *pos)
> >> +{
> >> + /* Skip entries we don't want to return to userspace */
> >> + while (pos && !(kernfs_active(pos) && (pos->ns == ns)))
> >> + pos = kernfs_dir_next(pos);
> >> return pos;
> >> }
> >>
> >> @@ -1638,7 +1660,7 @@ static int kernfs_fop_readdir(struct file *file,
> struct
> >> dir_context *ctx)
> >> {
> >> struct dentry *dentry = file->f_path.dentry;
> >> struct kernfs_node *parent = kernfs_dentry_node(dentry);
> >> - struct kernfs_node *pos = file->private_data;
> >> + struct kernfs_node *pos, *saved = file->private_data;
> >> const void *ns = NULL;
> >>
> >> if (!dir_emit_dots(file, ctx))
> >> @@ -1648,23 +1670,26 @@ static int kernfs_fop_readdir(struct file *file,
> >> struct dir_context *ctx)
> >> if (kernfs_ns_enabled(parent))
> >> ns = kernfs_info(dentry->d_sb)->ns;
> >>
> >> - for (pos = kernfs_dir_pos(ns, parent, ctx->pos, pos);
> >> - pos;
> >> + for (pos = kernfs_dir_pos(ns, parent, ctx->pos, saved);
> >> + (pos = kernfs_dir_skip_pos(ns, pos));
> >> pos = kernfs_dir_next_pos(ns, parent, ctx->pos, pos)) {
> >> const char *name = pos->name;
> >> unsigned int type = dt_type(pos);
> >> int len = strlen(name);
> >> ino_t ino = pos->id.ino;
> >>
> >> - ctx->pos = pos->hash;
> >> - file->private_data = pos;
> >> - kernfs_get(pos);
> >> + kernfs_put(saved);
> >> + saved = pos;
> >> + ctx->pos = saved->hash;
> >> + file->private_data = saved;
> >> + kernfs_get(saved);
> >>
> >> mutex_unlock(&kernfs_mutex);
> >> if (!dir_emit(ctx, name, len, ino, type))
> >> return 0;
> >> mutex_lock(&kernfs_mutex);
> >> }
> >> + kernfs_put(saved);
> >> mutex_unlock(&kernfs_mutex);
> >> file->private_data = NULL;
> >> ctx->pos = INT_MAX;
> >> --
> >> 2.14.1
> >>
> >>
>
> -----Original Message-----
> From: Eric W. Biederman [mailto:[email protected]]
> Sent: Tuesday, June 5, 2018 11:03 AM
> To: Hatayama, Daisuke <[email protected]>
> Cc: '[email protected]' <[email protected]>;
> '[email protected]' <[email protected]>; Okajima, Toshiyuki
> <[email protected]>; [email protected];
> '[email protected]' <[email protected]>
> Subject: Re: [CFT][PATCH] kernfs: Correct kernfs directory seeks.
>
> [email protected] (Eric W. Biederman) writes:
>
> > "Hatayama, Daisuke" <[email protected]> writes:
> >
> >>> Can you test this and please verify it fixes your issue?
> >>
> >> I tried this patch on top of v4.17 but the system fails to boot
> >> without detecting root disks by dracut like this:
> [snip]
>
> >> OTOH, there's no issue on the pure v4.17 kernel.
> >>
> >> As above, ls /sys/module looks apparently good. But I guess any part of
> >> behavior of getdentries() on sysfs must have changed, affecting the disk
> >> detection...
> >
> > I haven't been able to reproduce this yet. My test system boots fine.
> > Which fedora are you testing on?
>
> I reproduced something similar and fedora 28. So I think I have found
> and fixed the issue. I believe I simply reversed the test at the end of
> kernfs_dir_pos. AKA "<" instead of ">".
Though too late, I used fedora 28 and RHEL7.5.
I applied this fix to your patch and the system boot was successfully done.
I'll start testing your patch from now on.
>
> I am going to see if I can test my changes more throughly on this side
> and then repost.
>
>
>
> >>> fs/kernfs/dir.c | 109
> >>> ++++++++++++++++++++++++++++++++++----------------------
> >>> 1 file changed, 67 insertions(+), 42 deletions(-)
> >>>
> >>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> >>> index 89d1dc19340b..8148b5fec48d 100644
> >>> --- a/fs/kernfs/dir.c
> >>> +++ b/fs/kernfs/dir.c
> >>> @@ -1584,53 +1584,75 @@ static int kernfs_dir_fop_release(struct inode
> *inode,
> >>> struct file *filp)
> >>> return 0;
> >>> }
> >>>
> >>> +static struct kernfs_node *kernfs_dir_next(struct kernfs_node *pos)
> >>> +{
> >>> + struct rb_node *node = rb_next(&pos->rb);
> >>> + return node ? rb_to_kn(node) : NULL;
> >>> +}
> >>> +
> >>> static struct kernfs_node *kernfs_dir_pos(const void *ns,
> >>> - struct kernfs_node *parent, loff_t hash, struct kernfs_node *pos)
> >>> + struct kernfs_node *parent, loff_t off, struct kernfs_node *saved)
> >>> {
> >>> - if (pos) {
> >>> - int valid = kernfs_active(pos) &&
> >>> - pos->parent == parent && hash == pos->hash;
> >>> - kernfs_put(pos);
> >>> - if (!valid)
> >>> - pos = NULL;
> >>> - }
> >>> - if (!pos && (hash > 1) && (hash < INT_MAX)) {
> >>> - struct rb_node *node = parent->dir.children.rb_node;
> >>> - while (node) {
> >>> - pos = rb_to_kn(node);
> >>> -
> >>> - if (hash < pos->hash)
> >>> - node = node->rb_left;
> >>> - else if (hash > pos->hash)
> >>> - node = node->rb_right;
> >>> - else
> >>> - break;
> >>> + struct kernfs_node *pos;
> >>> + struct rb_node *node;
> >>> + unsigned int hash;
> >>> + const char *name = "";
> >>> +
> >>> + /* Is off a valid name hash? */
> >>> + if ((off < 2) || (off >= INT_MAX))
> >>> + return NULL;
> >>> + hash = off;
> >>> +
> >>> + /* Is the saved position usable? */
> >>> + if (saved) {
> >>> + /* Proper parent and hash? */
> >>> + if ((parent != saved->parent) || (saved->hash != hash)) {
> >>> + saved = NULL;
> >>
> >> name is uninitialized in this path.
> >
> > It is. name is initialized to "" see above.
> >
> >>> + } else {
> >>> + if (kernfs_active(saved))
> >>> + return saved;
> >>> + name = saved->name;
> >>> }
> >>> }
> >>> - /* Skip over entries which are dying/dead or in the wrong namespace
> >>> */
> >>> - while (pos && (!kernfs_active(pos) || pos->ns != ns)) {
> >>> - struct rb_node *node = rb_next(&pos->rb);
> >>> - if (!node)
> >>> - pos = NULL;
> >>> +
> >>> + /* Find the closest pos to the hash we are looking for */
> >>> + pos = NULL;
> >>> + node = parent->dir.children.rb_node;
> >>> + while (node) {
> >>> + int result;
> >>> +
> >>> + pos = rb_to_kn(node);
> >>> + result = kernfs_name_compare(hash, name, ns, pos);
> >>> + if (result < 0)
> >>> + node = node->rb_left;
> >>> + else if (result > 0)
> >>> + node = node->rb_right;
> >>> else
> >>> - pos = rb_to_kn(node);
> >>> + break;
> >>> }
> >>> +
> >>> + /* Ensure pos is at or beyond the target position */
> >>> + if (pos && (kernfs_name_compare(hash, name, ns, pos) < 0))
> ^^^^^^^^^^^^^^^^
> should be > 0
> >>> + pos = kernfs_dir_next(pos);
> >>> +
>
> Eric
>
"Hatayama, Daisuke" <[email protected]> writes:
>> >> +
>> >> + /* Is the saved position usable? */
>> >> + if (saved) {
>> >> + /* Proper parent and hash? */
>> >> + if ((parent != saved->parent) || (saved->hash != hash)) {
>> >> + saved = NULL;
>> >
>> > name is uninitialized in this path.
>>
>> It is. name is initialized to "" see above.
>>
>
> Or when either of the conditions is true, it has resulted in some inconsistent state, right?
> So, why not terminating this session of readdir() immediately by
> returning NULL just as when off is turned out to be invalid?
What I have above is not the clearest, and in fact the logic could be
better.
The fundamental challenge is because hash collisions are possible a file
offset does not hold complete position information in a directory.
So the kernfs node that is to be read/displayed next is saved in the
struct file. The it is tested if the saved kernfs node is usable
for finding the location in the directory. Several things may have
gone wrong.
- Someone may have called seekdir.
- The saved kernfs node may have been renamed.
- The saved kernfs node may have been moved to a different directory in
kernfs.
- the saved kernfs node may have been deleted.
If any of those are true the code needs to do the rbtree lookup.
If the kernfs node has been deleted or moved to a different directory we
can safely use it's name while performing the rbtree lookup. Which in
the event of a hash collision will be more accurate in finding our old
location, and preventing the same directory entry being returned
multiple times.
Which is completely different than if the directory offset is an invalid
value that will never point to any directory entries.
Eric
Hello,
On Tue, Jun 05, 2018 at 10:31:36AM -0500, Eric W. Biederman wrote:
> What I have above is not the clearest, and in fact the logic could be
> better.
>
> The fundamental challenge is because hash collisions are possible a file
> offset does not hold complete position information in a directory.
>
> So the kernfs node that is to be read/displayed next is saved in the
> struct file. The it is tested if the saved kernfs node is usable
> for finding the location in the directory. Several things may have
> gone wrong.
>
> - Someone may have called seekdir.
> - The saved kernfs node may have been renamed.
> - The saved kernfs node may have been moved to a different directory in
> kernfs.
> - the saved kernfs node may have been deleted.
>
> If any of those are true the code needs to do the rbtree lookup.
So, given that the whole thing is protected by a mutex which protects
modifications, it could be an option to simply keep track of who's
iterating what and shift them on removals. IOW, always keep cursor
pointing to the next thing to visit and if that gets removed shift the
cursor to the next one.
Thanks.
--
tejun
"'[email protected]'" <[email protected]> writes:
> Hello,
>
> On Tue, Jun 05, 2018 at 10:31:36AM -0500, Eric W. Biederman wrote:
>> What I have above is not the clearest, and in fact the logic could be
>> better.
>>
>> The fundamental challenge is because hash collisions are possible a file
>> offset does not hold complete position information in a directory.
>>
>> So the kernfs node that is to be read/displayed next is saved in the
>> struct file. The it is tested if the saved kernfs node is usable
>> for finding the location in the directory. Several things may have
>> gone wrong.
>>
>> - Someone may have called seekdir.
>> - The saved kernfs node may have been renamed.
>> - The saved kernfs node may have been moved to a different directory in
>> kernfs.
>> - the saved kernfs node may have been deleted.
>>
>> If any of those are true the code needs to do the rbtree lookup.
>
> So, given that the whole thing is protected by a mutex which protects
> modifications, it could be an option to simply keep track of who's
> iterating what and shift them on removals. IOW, always keep cursor
> pointing to the next thing to visit and if that gets removed shift the
> cursor to the next one.
Yes. We could.
The primary case we have to worry about is someone using seekdir,
and for that we always need the rbtree lookup. For seekdir
we could invalidate the saved entry and make things simpler
that way.
We could add list_head to the kernfs_node and create:
struct kernfs_dir_file {
struct list_head entry;
struct kernfs_node *kn;
}
And point at that from struct file->private_data.
I don't know if it would be worth the trouble to do that over a quick
check to make certain the kernfs_node is what it is expected to be.
But that is an option.
Part of the pain of supporting seekdir is that the offset we expose
to userspace in has to be 32bit to support 32bit userspace applications.
Which unfortunately is small enough that if nothing else a name
collision can be brute forced. So we can not avoid handling collisions.
Sigh, I have found another issue with kernfs_fop_readdir.
We are not currently protecting file->private_data with the kernfs_mutex
or any other kind of serialization. Which means if two processes are
calling readdir on the same file descriptor we might get unpredictable
behavior.
It doesn't look too bad and easy enough to fix, but definitely something
to be watchful of.
Eric
On Tue, Jun 05, 2018 at 12:47:33PM -0500, Eric W. Biederman wrote:
> Sigh, I have found another issue with kernfs_fop_readdir.
>
> We are not currently protecting file->private_data with the kernfs_mutex
> or any other kind of serialization. Which means if two processes are
> calling readdir on the same file descriptor we might get unpredictable
> behavior.
>
> It doesn't look too bad and easy enough to fix, but definitely something
> to be watchful of.
As discussed off-list - this is not a problem; getdents() et.al. are
serialized on per-struct-file basis by fdget_pos() in relevant syscalls,
since all directories automatically get FMODE_ATOMIC_POS in ->f_mode.