LinuxLists.cc - Re: sysfs: tagged directories not merged completely yet

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

On Fri, Oct 03, 2008 at 11:13:31AM +0100, Al Viro wrote:
> On Tue, Sep 23, 2008 at 11:23:57AM -0700, Eric W. Biederman wrote:
> > Benjamin Thery <[email protected]> writes:
> > >
> > > Oh.
> > > It's a pity Al couldn't re-review them before. We've already lost a lot
> > > of time with this patchset and it's blocking easier testing of network
> > > namespaces (right now, with a mainline kernel, we have to disable sysfs
> > > to build network namespaces).
> >
> > I am confident that we have a good base with these patches and the rest of
> > the work can be done incrementally on top of them if any issues turn up.
> >
> > Al recent rework of sysctl has a very similar structure.
>
> No, it does not. My apologies for delay, but here are more printable parts
> of review.

<snip>

Al, thanks a lot for the review, I really appreciate it.

Eric, I've removed the following patches from my tree so you can rework
them, if you want to.

driver-core/sysfs-support-for-preventing-unmounts.patch
driver-core/sysfs-sysfs_get_dentry-add-a-sb-parameter.patch
driver-core/sysfs-implement-__sysfs_get_dentry.patch
driver-core/sysfs-rename-support-multiple-superblocks.patch
driver-core/sysfs-introduce-sysfs_sd_setattr-and-fix-sysfs_chmod.patch
driver-core/sysfs-sysfs_chmod_file-handle-multiple-superblocks.patch
driver-core/sysfs-implement-sysfs-tagged-directory-support.patch
driver-core/sysfs-merge-sysfs_rename_dir-and-sysfs_move_dir.patch
driver-core/sysfs-implement-sysfs_delete_link-and-sysfs_rename_link.patch
driver-core/driver-core-implement-tagged-directory-support-for-device-classes.patch
driver-core/sysfs-remove-sysfs_create_link_nowarn.patch
driver-core/sysfs-revert-netns-fix-device-renaming-for-sysfs.patch
driver-core/netns-enable-tagging-for-net_class-directories-in-sysfs.patch
driver-core/sysfs-user-namespaces-fix-bug-with-clone-with-fairsched.patch

thanks,

greg k-h

2008-10-07 08:16:20

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Al Viro <[email protected]> writes:

> On Tue, Sep 23, 2008 at 11:23:57AM -0700, Eric W. Biederman wrote:
>> Benjamin Thery <[email protected]> writes:
>> >
>> > Oh.
>> > It's a pity Al couldn't re-review them before. We've already lost a lot
>> > of time with this patchset and it's blocking easier testing of network
>> > namespaces (right now, with a mainline kernel, we have to disable sysfs
>> > to build network namespaces).
>>
>> I am confident that we have a good base with these patches and the rest of
>> the work can be done incrementally on top of them if any issues turn up.
>>
>> Al recent rework of sysctl has a very similar structure.

Al thank you for taking a fresh look at sysfs.

For the rest of the readers. Most of the bugs and what seem to
me to be the most significant bugs that Al is seeing existing
without my sysfs tagged directory patchset.

> No, it does not. My apologies for delay, but here are more printable parts
> of review.

The concept of a single directory tree in the basic data structure
where only some parts show up depending upon the namespace you are in
conceptually the same, and that is what I was referring to.

I'm going to attempt to step back a bit and see if I can convert this into
a constructive exchange. Picking on the nasty details without discussing
the reasons for those details won't get us anywhere.

>From the point of view of the VFS sysfs is a distributed filesystem.

As a distributed filesystem we don't always go through the VFS sanity
checks when a filesystem change occurs, and only update the VFS caches
to the state of the change.

As a distributed filesystem sysfs internally needs to provide all of
the locking necessary to ensure the sysfs data structures are consistent.

Currently there are two basic paths into sysfs:
- Directly to the kernel internal state to the internal sysfs state.
This is where all mutation happens.
- Read-only through the VFS.

In the case of directory structure mutation, the mutation has happened
and the VFS simply must be told about it.

Which means that in a sane and simple design of a global filesystem
the locking would work was follows:

o Obtain filesystem lock
o Modify filesystem internal state.
o For each mounted instance
o Obtain VFS locks
o Update the mounted instance.
o Release VFS locks
o Release filesystem lock.

The above is simple and allows the same locking ordering for all accesses.

Unfortunately the VFS does not give us the opportunity to takes the locks
in the proper order. And it insists on grabbing a VFS lock before we can
grab a global lock.

Which in my case leads to weirdness like:

> a) You do *not* share struct inode between the superblocks, for fsck sake!

I can't currently use separate inodes because of the current locking of sysfs,
because by the current VFS limitations. In particular:

The currently locking order is:
mutex_lock(&inode->i_mutex);
mutex_lock(&sysfs_mutex);

If I use separate inodes that locking order can not be maintained. Frankly I would
love to change this.

> d) You REALLY, REALLY do not unhash busy dentries of directories.

Quiet frankly if the VFS requires the dcache to be out of sync
with the filesystem that is a VFS bug.

If the filesystem deletes the directory then it should be unhashed
so it can not be looked up.

Currently I know unhashing a busy directory potentially causes
mount leaks. Are you thinking of anything else?

> * may I politely suggest that
> again:
> mutex_lock(&A);
> if (!mutex_trylock(&B)) {
> mutex_unlock(&A);
> goto again;
> }
>
> is somewhat, er, deficient way to deal with buggered locking hierarchy?

Nah. In this case just a silly oversight. It is a trivial fix.

> Not to mention anything else, that's obviously FUBAR on UP box - if we
> have B contended, we've just killed the box dead since we'll never give
> the CPU back to whoever happens to hold B. See sysfs_mv_dir() for a lovely
> example.

Good point about UP deadlocks.

It shouldn't be hard to do the equivalent of lock_rename() and unlock_rename().

> c) You do *not* change dentry tree topology without s_vfs_rename_mutex on
> affected superblock. That, BTW, is broken in mainline sysfs as well.

Hardly. sb->s_vfs_rename_mutex only serializes against renames in the VFS.
As such it isn't useful to protect the sysfs data structures. That is
the job of sysfs_rename_mutex. At which point sb->s_vfs_rename_mutex
is redundant.

The only places I can see where s_vfs_rename_mutex is taken is in
lock_rename(), unlock_rename() and d_materialise_unique().
d_materialise_unique() only gets called by NFS. lock_rename() and
unlock_rename() only get called from renameat which always fails in the
case of sysfs and we aren't doing anything silly like hard links to
directories so I don't see a way to support that path.

> b) You do *not* grab i_mutex on ancestors after having grabbed it on
> file, as sysfs_chmod_file() does.

Ouch! Good point. That is simple enough to solve by reordering the
operations.

> In addition to that, there are interesting internal problems:
> * inumbers are released by final sysfs_put(); that can happen before the
> final iput() on corresponding inode. Guess what happens if new entry is
> created in the meanwhile, reuses the same inumber and lookup gets to
> sysfs_get_inode() on it?

Cute. Sounds like we need to put a reference to the sysfs_dirent into the inode.

> * sysfs_count_nlink() is called from sysfs_fill_super() without sysfs_mutex;
> now this sucker can get called at any moment.

> * just what is protecting ->s_tag?

sd->s_tag can only be changed by sysfs_mv_dir so it should share the same locking
as anything else that can be renamed.

> * __sysfs_remove_dir() appears to assume that subdirectories are possible;
> AFAICS, if we *do* get them, we get very screwed after remove_dir().

I haven't touched that one, and I am tired tonight. Are you thinking
of something in addition to an early sysfs_put and a

> * everything else aside, the internal locking is extremely heavy. For
> fsck sake, guys, a single system-wide mutex that can be grabbed for the
> duration of readdir on any directory and block just about anything
> in the filesystem? Just mmap() something over NFS on a slow link and
> do getdents() to such buffer. Watch a *lot* of stuff getting buggered.
> Hell, you can't even do ifconfig up while that sucker is held...

I don't see where we come anywhere close to sysfs for running ifconfig up.

Certainly ifconfig up works fine with sysfs compiled out and I haven't
spotted the path from the networking code to sysfs in that instance.
sysfs really only shows up in adding and removing network interfaces
in my experience.

> Seriously, people, it's getting worse than devfs had ever been ;-/

I haven't looked at devfs so I wouldn't know. I just know it feels
futile a lot of time to try and improve sysfs.

Eric

2008-10-07 08:36:14

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Greg KH <[email protected]> writes:

> Al, thanks a lot for the review, I really appreciate it.

> Eric, I've removed the following patches from my tree so you can rework
> them, if you want to.

Well what Al focused on has very little to do with tagged directories
and mostly do with sysfs itself. So my current plan is to write
incremental patches that fix sysfs_chmod_file and sysfs_mv_dir,
in the next couple of hours and call it good for the moment.

Unless someone will give an example of how having multiple superblocks
sharing inodes is a problem in practice for sysfs and call it good
for 2.6.28. Certainly it shouldn't be an issue if the network namespace
code is compiled out. And it should greatly improve testing of the
network namespace to at least have access to sysfs.

Later Tejun or I or possibly someone else who cares can go back
and simplify the sysfs locking to remove the need for multiple
superblocks sharing inodes, and to address the other big nasties in
the current sysfs implementation.

Greg I agree with Al that sysfs isn't perfect but we sure aren't going
to fix it if you keep dropping or taking years to merge every patch
from the people working on it, and then dropping those patches because
someone frowns at them.

Eric

2008-10-07 09:01:50

by Daniel Lezcano

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Al Viro wrote:
> On Tue, Sep 23, 2008 at 11:23:57AM -0700, Eric W. Biederman wrote:
>> Benjamin Thery <[email protected]> writes:
>>> Oh.
>>> It's a pity Al couldn't re-review them before. We've already lost a lot
>>> of time with this patchset and it's blocking easier testing of network
>>> namespaces (right now, with a mainline kernel, we have to disable sysfs
>>> to build network namespaces).
>> I am confident that we have a good base with these patches and the rest of
>> the work can be done incrementally on top of them if any issues turn up.
>>
>> Al recent rework of sysctl has a very similar structure.
>
> No, it does not. My apologies for delay, but here are more printable parts
> of review.
>
> First of all, this stuff breaks just about every damn integrity constraint VFS
> knows of. It tries to tiptoe through the resulting minefield - without
> success. So the first group of comments will be of "you *really* don't
> do $FOO" variety. I'm very far from being convinced that we want to
> special-case in VFS every kind of weirdness sysfs happens to do; in effect,
> that would require adding a FS_IS_SYSFS_SO_BEND_OVER fs type flag and making
> a lot of locking conditional on that.
>
> a) You do *not* share struct inode between the superblocks, for fsck sake!
> b) You do *not* grab i_mutex on ancestors after having grabbed it on
> file, as sysfs_chmod_file() does.
> c) You do *not* change dentry tree topology without s_vfs_rename_mutex on
> affected superblock. That, BTW, is broken in mainline sysfs as well.
> d) You REALLY, REALLY do not unhash busy dentries of directories.
>
> In addition to that, there are interesting internal problems:
> * inumbers are released by final sysfs_put(); that can happen before the
> final iput() on corresponding inode. Guess what happens if new entry is
> created in the meanwhile, reuses the same inumber and lookup gets to
> sysfs_get_inode() on it?
> * may I politely suggest that
> again:
> mutex_lock(&A);
> if (!mutex_trylock(&B)) {
> mutex_unlock(&A);
> goto again;
> }
> is somewhat, er, deficient way to deal with buggered locking hierarchy?
> Not to mention anything else, that's obviously FUBAR on UP box - if we
> have B contended, we've just killed the box dead since we'll never give
> the CPU back to whoever happens to hold B. See sysfs_mv_dir() for a lovely
> example.
> * sysfs_count_nlink() is called from sysfs_fill_super() without sysfs_mutex;
> now this sucker can get called at any moment.
> * just what is protecting ->s_tag?
> * __sysfs_remove_dir() appears to assume that subdirectories are possible;
> AFAICS, if we *do* get them, we get very screwed after remove_dir().
> * everything else aside, the internal locking is extremely heavy. For
> fsck sake, guys, a single system-wide mutex that can be grabbed for the
> duration of readdir on any directory and block just about anything
> in the filesystem? Just mmap() something over NFS on a slow link and
> do getdents() to such buffer. Watch a *lot* of stuff getting buggered.
> Hell, you can't even do ifconfig up while that sucker is held...
>
> Seriously, people, it's getting worse than devfs had ever been ;-/

Thank you Al for reviewing the patchset.

-- Daniel

2008-10-07 09:16:42

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Hello, Al.

Thanks for reviewing. I'll reply to what I can from the top of my
head for now.

Al Viro wrote:
> c) You do *not* change dentry tree topology without s_vfs_rename_mutex on
> affected superblock. That, BTW, is broken in mainline sysfs as well.

Hmmm...

/*
* The next field is for VFS *only*. No filesystems have any business
* even looking at it. You had been warned.
*/
struct mutex s_vfs_rename_mutex; /* Kludge */

How is a distributed filesystem supposed to do this?

> In addition to that, there are interesting internal problems:
> * inumbers are released by final sysfs_put(); that can happen before the
> final iput() on corresponding inode. Guess what happens if new entry is
> created in the meanwhile, reuses the same inumber and lookup gets to
> sysfs_get_inode() on it?

Heh... right. Will fix.

> * may I politely suggest that
> again:
> mutex_lock(&A);
> if (!mutex_trylock(&B)) {
> mutex_unlock(&A);
> goto again;
> }
> is somewhat, er, deficient way to deal with buggered locking hierarchy?
> Not to mention anything else, that's obviously FUBAR on UP box - if we
> have B contended, we've just killed the box dead since we'll never give
> the CPU back to whoever happens to hold B. See sysfs_mv_dir() for a lovely
> example.

Yeah, that was dumb. I should have just used inode_double_lock()
there.

> * sysfs_count_nlink() is called from sysfs_fill_super() without sysfs_mutex;
> now this sucker can get called at any moment.

In mainline, being single sb fs, it was okay. Yeah, it now needs to
be fixed.

> * __sysfs_remove_dir() appears to assume that subdirectories are possible;
> AFAICS, if we *do* get them, we get very screwed after remove_dir().

sysfs currently requires its users to remove all the children before
removing a directory, which is a bit dumb. I once posted patches to
make remove recursive and am still planning on refreshing and posting
them.

> * everything else aside, the internal locking is extremely heavy. For
> fsck sake, guys, a single system-wide mutex that can be grabbed for the
> duration of readdir on any directory and block just about anything
> in the filesystem? Just mmap() something over NFS on a slow link and
> do getdents() to such buffer. Watch a *lot* of stuff getting buggered.
> Hell, you can't even do ifconfig up while that sucker is held...

Ah.. right, filler can be an issue but all other operations are from
memory to memory, there just isn't much point in elaborating its
locking. Memory usage has been a much bigger problem and making
locking per sysfs_dirent increases memory consumption and on
crazy-large irons things like that become quite noticeable.

If the filler is a real concern, I think it's better to just decouple
it rather than making sysfs locking fine-grained. sysfs metadata
might as well be protected by a single spinlock if it can be decoupled
from vfs locking and stuff. It's just an in-memory tree which isn't
used too often.

> Seriously, people, it's getting worse than devfs had ever been ;-/

I don't know about devfs either but sysfs today is in much better
shape than say about one and half year ago. Triggering deadlocks and
oopses from userland was quite easy back then.

Generally, the VFS layer isn't too easy for sysfs which is a bit like
distributed filesystem but has more strict here-and-now rule (all
changes should be visible instantaneously). At the beginning, sysfs
didn't have much metadata itself, it just used the VFS data structures
but that was too large so sysfs_dirent got introduced and it tried to
update VFS data structures as necessary and (this is when I started
working on it) the current code and Eric's patcheset evolved from
there.

Maybe it can be done better by taking more traditional distributed
filesystem approach - re/invalidation on access. I don't know whether
it will fit sysfs's needs but if it can be done, sysfs would be able
to ride along with other distributed filesystems and become much more
conventional in its interfacing with VFS.

Thanks.

--
tejun

2008-10-07 10:56:18

[permalink] [raw]

Subject: [PATCH 0/3] minor sysfs tagged directory fixes

This patches are much more reviewable as incremental fixes than as a
respun patchset. So on top of my existing patchset.

Eric W. Biederman (3):
sysfs: Remove lock ordering violation in sysfs_chmod_file.
sysfs: Fix and sysfs_mv_dir by using lock_rename.
sysfs: Take sysfs_mutex when fetching the root inode.

2008-10-07 10:56:31

[permalink] [raw]

Subject: [PATCH 3/3] sysfs: Take sysfs_mutex when fetching the root inode.

sysfs_get_inode ultimately calls sysfs_count_nlink when the a
directory inode is fectched. sysfs_count_nlink needs to be called
under the sysfs_mutex to guard against the unlikely but possible
scenario that the root directory is changing as we are counting the
number entries in it.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/sysfs/mount.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index 8f2237a..59e55f0 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -65,7 +65,9 @@ static int sysfs_fill_super(struct super_block *sb, void *data, int silent,

/* get root inode, initialize and unlock it */
error = -ENOMEM;
+ mutex_lock(&sysfs_mutex);
inode = sysfs_get_inode(&sysfs_root);
+ mutex_unlock(&sysfs_mutex);
if (!inode) {
pr_debug("sysfs: could not get root inode\n");
goto out_err;
--
1.5.3.rc6.17.g1911

2008-10-07 10:57:00

[permalink] [raw]

Subject: [PATCH 2/3] sysfs: Fix and sysfs_mv_dir by using lock_rename.

sysfs_mv_dir picks an arbitrary mutex lock ordering and busy waits
until it can grab the inode mutexes in that order. This can dead lock
on UP systems.

Instead use the generic lock_rename that guarantees we grab parent
directories before child directories, so there is not need to spin and
no chance of deadlock.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/sysfs/dir.c | 43 +++++++++++++++++++------------------------
1 files changed, 19 insertions(+), 24 deletions(-)

diff --git a/fs/sysfs/dir.c b/fs/sysfs/dir.c
index a76fb54..94f2ed1 100644
--- a/fs/sysfs/dir.c
+++ b/fs/sysfs/dir.c
@@ -955,12 +955,22 @@ err_out:
return error;
}

+static void sysfs_lock_rename(struct sysfs_rename_struct *srs)
+{
+ lock_rename(srs->old_parent, srs->new_parent);
+}
+
+static void sysfs_unlock_rename(struct sysfs_rename_struct *srs)
+{
+ unlock_rename(srs->old_parent, srs->new_parent);
+}
+
static int sysfs_mv_dir(struct sysfs_dirent *sd,
struct sysfs_dirent *new_parent_sd, const char *new_name)
{
struct list_head todo;
struct sysfs_rename_struct *srs;
- struct inode *old_parent_inode = NULL, *new_parent_inode = NULL;
+ struct sysfs_dirent *old_parent_sd;
const char *dup_name = NULL;
const void *old_tag, *tag;
int error;
@@ -971,11 +981,12 @@ static int sysfs_mv_dir(struct sysfs_dirent *sd,
if (!new_parent_sd)
new_parent_sd = &sysfs_root;

+ old_parent_sd = sysfs_get(sd->s_parent);
old_tag = sd->s_tag;
- tag = sysfs_creation_tag(sd->s_parent, sd);
+ tag = sysfs_creation_tag(old_parent_sd, sd);

error = 0;
- if ((sd->s_parent == new_parent_sd) && (old_tag == tag) &&
+ if ((old_parent_sd == new_parent_sd) && (old_tag == tag) &&
(strcmp(sd->s_name, new_name) == 0))
goto out; /* nothing to do */

@@ -986,22 +997,8 @@ static int sysfs_mv_dir(struct sysfs_dirent *sd,
goto out_release;
}

- error = -ENOMEM;
- mutex_lock(&sysfs_mutex);
- old_parent_inode = sysfs_get_inode(sd->s_parent);
- new_parent_inode = sysfs_get_inode(new_parent_sd);
- mutex_unlock(&sysfs_mutex);
- if (!old_parent_inode || !new_parent_inode)
- goto out_release;
-
-again:
- mutex_lock(&old_parent_inode->i_mutex);
- if (old_parent_inode != new_parent_inode) {
- if (!mutex_trylock(&new_parent_inode->i_mutex)) {
- mutex_unlock(&old_parent_inode->i_mutex);
- goto again;
- }
- }
+ if (!list_empty(&todo))
+ sysfs_lock_rename(list_first_entry(&todo, struct sysfs_rename_struct, list));
mutex_lock(&sysfs_mutex);

error = -EEXIST;
@@ -1048,17 +1045,15 @@ again:
error = 0;
out_unlock:
mutex_unlock(&sysfs_mutex);
- if (new_parent_inode != old_parent_inode)
- mutex_unlock(&new_parent_inode->i_mutex);
- mutex_unlock(&old_parent_inode->i_mutex);
+ if (!list_empty(&todo))
+ sysfs_unlock_rename(list_first_entry(&todo, struct sysfs_rename_struct, list));
kfree(dup_name);

out_release:
- iput(new_parent_inode);
- iput(old_parent_inode);
post_rename(&todo);
sysfs_release_supers();
out:
+ sysfs_put(old_parent_sd);
mutex_unlock(&sysfs_rename_mutex);
return error;
}
--
1.5.3.rc6.17.g1911

2008-10-07 10:56:46

[permalink] [raw]

Subject: [PATCH 1/3] sysfs: Remove lock ordering violation in sysfs_chmod_file.

It is a wee bit subtle but sysfs_get_dentry grabs inode->i_mutex.
of potentially all of the parents of sd. So I can not hold
the inode mutex of the directory while it is called.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/sysfs/file.c | 37 ++++++++++++++++---------------------
1 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 091c0de..8b572b6 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -582,7 +582,6 @@ int sysfs_chmod_file(struct kobject *kobj, struct attribute *attr, mode_t mode)
{
struct sysfs_dirent *victim_sd = NULL;
struct super_block *sb;
- struct inode * inode = NULL;
struct iattr newattrs;
int rc;

@@ -591,42 +590,38 @@ int sysfs_chmod_file(struct kobject *kobj, struct attribute *attr, mode_t mode)
if (!victim_sd)
goto out;

- rc = -ENOENT;
- mutex_lock(&sysfs_mutex);
- inode = sysfs_get_inode(victim_sd);
- mutex_unlock(&sysfs_mutex);
- if (!inode)
- goto out;
-
mutex_lock(&sysfs_rename_mutex);
sysfs_grab_supers();
- mutex_lock(&inode->i_mutex);
-
- newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
- newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
- newattrs.ia_ctime = current_fs_time(inode->i_sb);
- rc = sysfs_sd_setattr(victim_sd, inode, &newattrs);
- if (rc)
- goto out_unlock;

list_for_each_entry(sb, &sysfs_fs_type.fs_supers, s_instances) {
+ struct dentry * victim;
+ struct inode * inode;
+
/* Ignore it when the dentry does not exist on the
* target superblock.
*/
- struct dentry * victim = sysfs_get_dentry(sb, victim_sd);
+ mutex_lock(&sysfs_mutex);
+ victim = sysfs_get_dentry(sb, victim_sd);
+ mutex_unlock(&sysfs_mutex);
if (IS_ERR(victim))
continue;

- fsnotify_change(victim, newattrs.ia_valid);
+ inode = victim->d_inode;
+ mutex_lock(&inode->i_mutex);
+ newattrs.ia_mode = (mode & S_IALLUGO) | (inode->i_mode & ~S_IALLUGO);
+ newattrs.ia_valid = ATTR_MODE | ATTR_CTIME;
+ newattrs.ia_ctime = current_fs_time(inode->i_sb);
+ rc = sysfs_sd_setattr(victim_sd, inode, &newattrs);
+ if (!rc)
+ fsnotify_change(victim, newattrs.ia_valid);
+
+ mutex_unlock(&inode->i_mutex);
dput(victim);
}

- out_unlock:
- mutex_unlock(&inode->i_mutex);
sysfs_release_supers();
mutex_unlock(&sysfs_rename_mutex);
out:
- iput(inode);
sysfs_put(victim_sd);
return rc;
}
--
1.5.3.rc6.17.g1911

2008-10-07 12:06:18

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

> If the filler is a real concern, I think it's better to just decouple
> it rather than making sysfs locking fine-grained. sysfs metadata
> might as well be protected by a single spinlock if it can be decoupled
> from vfs locking and stuff. It's just an in-memory tree which isn't
> used too often.

I think with a little care we can make the sysfs read side rcu
protected which would remove any real locking from lookup
and readdir.

> Generally, the VFS layer isn't too easy for sysfs which is a bit like
> distributed filesystem but has more strict here-and-now rule (all
> changes should be visible instantaneously). At the beginning, sysfs
> didn't have much metadata itself, it just used the VFS data structures
> but that was too large so sysfs_dirent got introduced and it tried to
> update VFS data structures as necessary and (this is when I started
> working on it) the current code and Eric's patcheset evolved from
> there.

> Maybe it can be done better by taking more traditional distributed
> filesystem approach - re/invalidation on access. I don't know whether
> it will fit sysfs's needs but if it can be done, sysfs would be able
> to ride along with other distributed filesystems and become much more
> conventional in its interfacing with VFS.

The revalidate on access model doesn't appear to have a way to track
remote renames. Something sysfs supports.

I have just spent a little bit of time thinking it through. I had previously
thought that we could take advantage of the fact that sysfs only allows
VFS reads we could fix our backwards lock ordering by optimizing the read
side with rcu. Unfortunately the VFS still takes locks on rename and similar
paths despite the fact sysfs does not implement those paths functions. Therefore
whatever we do has to be handle all VFS operations even if we don't support
them. Weird, but true.

We may need to delay dentry unhashing until revalidate. I think I see
some issues if we don't do that.

Eric

2008-10-07 12:21:55

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Eric W. Biederman wrote:
>> If the filler is a real concern, I think it's better to just decouple
>> it rather than making sysfs locking fine-grained. sysfs metadata
>> might as well be protected by a single spinlock if it can be decoupled
>> from vfs locking and stuff. It's just an in-memory tree which isn't
>> used too often.
>
> I think with a little care we can make the sysfs read side rcu
> protected which would remove any real locking from lookup
> and readdir.

IIRC, the original readdir implementation put a cursor entry to walk
through the children list. The implementation was horribly broken in
a number of different ways (ISTR problems with locking and multiple
and different type of walkers) and I just gutted out all the
complexity out and made it simple as getting it correct was far more
important and there seemed to be little need for optimization.

Yeah, using RCU sounds like a plan.

>> Generally, the VFS layer isn't too easy for sysfs which is a bit like
>> distributed filesystem but has more strict here-and-now rule (all
>> changes should be visible instantaneously). At the beginning, sysfs
>> didn't have much metadata itself, it just used the VFS data structures
>> but that was too large so sysfs_dirent got introduced and it tried to
>> update VFS data structures as necessary and (this is when I started
>> working on it) the current code and Eric's patcheset evolved from
>> there.
>>
>> Maybe it can be done better by taking more traditional distributed
>> filesystem approach - re/invalidation on access. I don't know whether
>> it will fit sysfs's needs but if it can be done, sysfs would be able
>> to ride along with other distributed filesystems and become much more
>> conventional in its interfacing with VFS.
>
> The revalidate on access model doesn't appear to have a way to track
> remote renames. Something sysfs supports.

Yeap, IIRC, one of the reasons why sysfs wasn't converted over to
sysfs was because sysfs guarantees inode doesn't change over rename or
move so that notifications keep working over renames.

> I have just spent a little bit of time thinking it through. I had
> previously thought that we could take advantage of the fact that
> sysfs only allows VFS reads we could fix our backwards lock ordering
> by optimizing the read side with rcu. Unfortunately the VFS still
> takes locks on rename and similar paths despite the fact sysfs does
> not implement those paths functions. Therefore whatever we do has
> to be handle all VFS operations even if we don't support them.
> Weird, but true.
>
> We may need to delay dentry unhashing until revalidate. I think I see
> some issues if we don't do that.

Ah... okay. It shouldn't be difficult, right?

Thanks.

--
tejun

2008-10-07 21:20:26

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 1/3] sysfs: Remove lock ordering violation in sysfs_chmod_file.

On Tue, 2008-10-07 at 03:49 -0700, Eric W. Biederman wrote:
> It is a wee bit subtle but sysfs_get_dentry grabs inode->i_mutex.
> of potentially all of the parents of sd. So I can not hold
> the inode mutex of the directory while it is called.

Hi Eric,

This patch looks good to me. In my quest to parse exactly what was
going on, I rewrote a description of the patch. Could you look over
this and see if I'm on target?

--

Before this patch, inode->i_mutex is held in order to keep the inode's
mode and ctime from changing out underneath us. If we didn't do this,
you could potentially get garbage when reading them out of the old
inode. We calculated these new permissions once since it is redundant
to do it several times.

We also need to perform a sysfs_get_dentry() operation on the
sysfs_dirent in order to find all the dentries on each sb. This needs
to be performed once for each sb in which the inode appears. "[B]ut
sysfs_get_dentry grabs inode->i_mutex. of potentially all of the parents
of sd. So I can not hold the inode mutex of the directory while it is
called."

This patch drops the inode->i_mutex over the entire "for each
sysfs_dirent" loop. It, instead, reacquires and drops it each time we
need to calculate the new mode/ctime for the target dentries. It does
this away from the sysfs_get_dentry() call now.

-- Dave

2008-10-07 21:24:44

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 2/3] sysfs: Fix and sysfs_mv_dir by using lock_rename.

On Tue, 2008-10-07 at 03:51 -0700, Eric W. Biederman wrote:
> sysfs_mv_dir picks an arbitrary mutex lock ordering and busy waits
> until it can grab the inode mutexes in that order. This can dead lock
> on UP systems.
>
> Instead use the generic lock_rename that guarantees we grab parent
> directories before child directories, so there is not need to spin and
> no chance of deadlock.

This is a vast improvement over what was there before. Nice!

-- Dave

2008-10-07 22:31:19

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
> Unless someone will give an example of how having multiple superblocks
> sharing inodes is a problem in practice for sysfs and call it good
> for 2.6.28. Certainly it shouldn't be an issue if the network namespace
> code is compiled out. And it should greatly improve testing of the
> network namespace to at least have access to sysfs.

But if the network namespace code is in? THen we have problems, right?
And that's the whole point here.

The fact that you are trying to limit userspace view of in-kernel data
structures, based on that specific user, is, in my opinion, crazy.

Why not just keep all users from seeing sysfs, and then have a user
daemon doing something on top of FUSE if you really want to see this
kind of stuff.

The "leakage" just seems too hard to stop.

> Later Tejun or I or possibly someone else who cares can go back
> and simplify the sysfs locking to remove the need for multiple
> superblocks sharing inodes, and to address the other big nasties in
> the current sysfs implementation.

I know how the whole "we'll go back later and fix it up" stuff works,
I've used that excuse too many times in the past myself. Never happens
:)

> Greg I agree with Al that sysfs isn't perfect but we sure aren't going
> to fix it if you keep dropping or taking years to merge every patch
> from the people working on it, and then dropping those patches because
> someone frowns at them.

"years"? Come on, these did take a while due to travel and other stuff.
These are core kernel changes, and need time to ensure that they work
properly, and get the proper review from people who understand this kind
of stuff.

And to call Al a generic "someone", is just rude and disrespectful. I
trust his opinion in this area far more than I do yours, to be honest.

This whole series is dropped, if you want to resubmit them, feel free
to, _after_ adressing his issues.

bah,

greg k-h

2008-10-07 22:36:22

[permalink] [raw]

Subject: Re: [PATCH 1/3] sysfs: Remove lock ordering violation in sysfs_chmod_file.

Dave Hansen <[email protected]> writes:

> On Tue, 2008-10-07 at 03:49 -0700, Eric W. Biederman wrote:
>> It is a wee bit subtle but sysfs_get_dentry grabs inode->i_mutex.
>> of potentially all of the parents of sd. So I can not hold
>> the inode mutex of the directory while it is called.
>
> Hi Eric,
>
> This patch looks good to me. In my quest to parse exactly what was
> going on, I rewrote a description of the patch. Could you look over
> this and see if I'm on target?

It sounds like you are on target.

> Before this patch, inode->i_mutex is held in order to keep the inode's
> mode and ctime from changing out underneath us. If we didn't do this,
> you could potentially get garbage when reading them out of the old
> inode. We calculated these new permissions once since it is redundant
> to do it several times.

Yes. updates to mtime and ctime need to be serialized by holding
the inode semaphore.

> We also need to perform a sysfs_get_dentry() operation on the
> sysfs_dirent in order to find all the dentries on each sb.

To find the dentry for a particular sb.

> This needs
> to be performed once for each sb in which the inode appears. "[B]ut
> sysfs_get_dentry grabs inode->i_mutex. of potentially all of the parents
> of sd. So I can not hold the inode mutex of the directory while it is
> called."

Yes.

> This patch drops the inode->i_mutex over the entire "for each
> sysfs_dirent" loop. It, instead, reacquires and drops it each time we
> need to calculate the new mode/ctime for the target dentries. It does
> this away from the sysfs_get_dentry() call now.

Yes.

Eric

2008-10-07 22:54:35

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Quoting Greg KH ([email protected]):
> On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
> > Unless someone will give an example of how having multiple superblocks
> > sharing inodes is a problem in practice for sysfs and call it good
> > for 2.6.28. Certainly it shouldn't be an issue if the network namespace
> > code is compiled out. And it should greatly improve testing of the
> > network namespace to at least have access to sysfs.
>
> But if the network namespace code is in? THen we have problems, right?
> And that's the whole point here.
>
> The fact that you are trying to limit userspace view of in-kernel data
> structures, based on that specific user, is, in my opinion, crazy.
>
> Why not just keep all users from seeing sysfs, and then have a user
> daemon doing something on top of FUSE if you really want to see this
> kind of stuff.

Well the blocker is really that when you create a new network namespace,
it wants to create a new loopback interface, but
/sys/devices/virtual/net/lo already exists. That's the same issue with
user namespace when the fair scheduler is enabled, which tries to
re-create /sys/kernel/uids/0.

Otherwise yeah at least for my own uses, containers wouldn't need to
look at /sys at all.

Heck you wouldn't even need FUSE, just mount -t tmpfs /sys/class/net
and manually link the right devices from /sys/devices/virtual/net.

-serge

2008-10-07 23:20:53

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Hello, a bit of additions after some sleep.

Tejun Heo wrote:
> Eric W. Biederman wrote:
> IIRC, the original readdir implementation put a cursor entry to walk
> through the children list. The implementation was horribly broken in
> a number of different ways (ISTR problems with locking and multiple
> and different type of walkers) and I just gutted out all the
> complexity out and made it simple as getting it correct was far more
> important and there seemed to be little need for optimization.
>
> Yeah, using RCU sounds like a plan.

Heh... it did sound like a plan but I don't think the plan would solve
the problem. filldir can't be put in rcu read critical section. :-p

>> The revalidate on access model doesn't appear to have a way to track
>> remote renames. Something sysfs supports.
>
> Yeap, IIRC, one of the reasons why sysfs wasn't converted over to
> sysfs was because sysfs guarantees inode doesn't change over rename or
> move so that notifications keep working over renames.

s/over to sysfs/over to revalidation/ and s/inode/dentry/. Maybe we can
just ignore dnotify? :-(

Thanks.

--
tejun

2008-10-07 23:37:37

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Hello, Greg.

Greg KH wrote:
> On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
>> Unless someone will give an example of how having multiple superblocks
>> sharing inodes is a problem in practice for sysfs and call it good
>> for 2.6.28. Certainly it shouldn't be an issue if the network namespace
>> code is compiled out. And it should greatly improve testing of the
>> network namespace to at least have access to sysfs.
>
> But if the network namespace code is in? THen we have problems, right?
> And that's the whole point here.
>
> The fact that you are trying to limit userspace view of in-kernel data
> structures, based on that specific user, is, in my opinion, crazy.

Well, that's the whole point of all the namespace stuff. If we're
gonna do namespaces, view of in-kernel data structures need to be
limited and modified one way or the other.

> Why not just keep all users from seeing sysfs, and then have a user
> daemon doing something on top of FUSE if you really want to see this
> kind of stuff.

That sounds nice. Out of ignorance, how is the /proc dealt with?
Maybe we can have some unified approach for this multiple views of the
system stuff.

> The "leakage" just seems too hard to stop.
>
>> Later Tejun or I or possibly someone else who cares can go back
>> and simplify the sysfs locking to remove the need for multiple
>> superblocks sharing inodes, and to address the other big nasties in
>> the current sysfs implementation.
>
> I know how the whole "we'll go back later and fix it up" stuff works,
> I've used that excuse too many times in the past myself. Never happens
> :)

Heh... I've been telling Jeff I would update libata API doc right
after the next big change for about two years now. :-)

>> Greg I agree with Al that sysfs isn't perfect but we sure aren't going
>> to fix it if you keep dropping or taking years to merge every patch
>> from the people working on it, and then dropping those patches because
>> someone frowns at them.
>
> "years"? Come on, these did take a while due to travel and other stuff.
> These are core kernel changes, and need time to ensure that they work
> properly, and get the proper review from people who understand this kind
> of stuff.

Eric has tried hard for quite some time to improve sysfs and get the
namespace stuff merged and it hasn't been a smooth process and mostly
for non-technical reasons at that (his changes crashing with mine and
me being lazy played a big part). So, now everything seemed to go in
and got dropped again, so I think he has good reasons to get
frustrated, so it would be really nice if we can discuss this with
some civility.

> And to call Al a generic "someone", is just rude and disrespectful. I
> trust his opinion in this area far more than I do yours, to be honest.

Al's review was very helpful but it wasn't the most respectful one
either. I understand that's his style and you do too, right? Then, I
don't think it's fair to call Eric rude and disrepectful for calling
Al a generic "someone". Let's let that be his style too.

> This whole series is dropped, if you want to resubmit them, feel free
> to, _after_ adressing his issues.

I think the more important thing to discuss is how this namespace
stuff is gonna proceed. I'm quite ignorant about the issue and one of
the reasons why I acked the changes although I had my reservations was
becuase the namespace approach seemed to have been agreed upon. Is
it? Can somebody hammer the big picture regarding namespaces into my
small head?

Thanks.

--
tejun

2008-10-07 23:42:56

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

On Tue, Oct 07, 2008 at 05:54:24PM -0500, Serge E. Hallyn wrote:
> Quoting Greg KH ([email protected]):
> > On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
> > > Unless someone will give an example of how having multiple superblocks
> > > sharing inodes is a problem in practice for sysfs and call it good
> > > for 2.6.28. Certainly it shouldn't be an issue if the network namespace
> > > code is compiled out. And it should greatly improve testing of the
> > > network namespace to at least have access to sysfs.
> >
> > But if the network namespace code is in? THen we have problems, right?
> > And that's the whole point here.
> >
> > The fact that you are trying to limit userspace view of in-kernel data
> > structures, based on that specific user, is, in my opinion, crazy.
> >
> > Why not just keep all users from seeing sysfs, and then have a user
> > daemon doing something on top of FUSE if you really want to see this
> > kind of stuff.
>
> Well the blocker is really that when you create a new network namespace,
> it wants to create a new loopback interface, but
> /sys/devices/virtual/net/lo already exists. That's the same issue with
> user namespace when the fair scheduler is enabled, which tries to
> re-create /sys/kernel/uids/0.
>
> Otherwise yeah at least for my own uses, containers wouldn't need to
> look at /sys at all.
>
> Heck you wouldn't even need FUSE, just mount -t tmpfs /sys/class/net
> and manually link the right devices from /sys/devices/virtual/net.

Great, that sounds like a solution.

So tell me again why we need these huge sysfs reworks? :)

thanks,

greg k-h

2008-10-08 00:06:20

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Tejun Heo <[email protected]> writes:

> Hello, a bit of additions after some sleep.

> Heh... it did sound like a plan but I don't think the plan would solve
> the problem. filldir can't be put in rcu read critical section. :-p

There is srcu and there is the trick of grabbing the reference count
on the current sysfs_dirent over the filldir and dropping the rcu
lock (which works for proc).

To cut down on lock overhead from user space accesses that works.

>>> The revalidate on access model doesn't appear to have a way to track
>>> remote renames. Something sysfs supports.
>>
>> Yeap, IIRC, one of the reasons why sysfs wasn't converted over to
>> sysfs was because sysfs guarantees inode doesn't change over rename or
>> move so that notifications keep working over renames.
>
> s/over to sysfs/over to revalidation/ and s/inode/dentry/. Maybe we can
> just ignore dnotify? :-(

Well there are more cases than dnotify, there is the renaming of directories
in sysfs, although rare that I think get awkward if we use revalidation.

I'm still not certain how we can get the lock ordering so it doesn't
cause us problems. I will look at revalidation and what the other
distributed filesystems are doing and see if that might work. If it
doesn't we need refactor the VFS locking.

Eric

2008-10-08 00:12:18

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Quoting Greg KH ([email protected]):
> On Tue, Oct 07, 2008 at 05:54:24PM -0500, Serge E. Hallyn wrote:
> > Quoting Greg KH ([email protected]):
> > > On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
> > > > Unless someone will give an example of how having multiple superblocks
> > > > sharing inodes is a problem in practice for sysfs and call it good
> > > > for 2.6.28. Certainly it shouldn't be an issue if the network namespace
> > > > code is compiled out. And it should greatly improve testing of the
> > > > network namespace to at least have access to sysfs.
> > >
> > > But if the network namespace code is in? THen we have problems, right?
> > > And that's the whole point here.
> > >
> > > The fact that you are trying to limit userspace view of in-kernel data
> > > structures, based on that specific user, is, in my opinion, crazy.
> > >
> > > Why not just keep all users from seeing sysfs, and then have a user
> > > daemon doing something on top of FUSE if you really want to see this
> > > kind of stuff.
> >
> > Well the blocker is really that when you create a new network namespace,
> > it wants to create a new loopback interface, but
> > /sys/devices/virtual/net/lo already exists. That's the same issue with
> > user namespace when the fair scheduler is enabled, which tries to
> > re-create /sys/kernel/uids/0.
> >
> > Otherwise yeah at least for my own uses, containers wouldn't need to
> > look at /sys at all.
> >
> > Heck you wouldn't even need FUSE, just mount -t tmpfs /sys/class/net
> > and manually link the right devices from /sys/devices/virtual/net.
>
> Great, that sounds like a solution.
>
> So tell me again why we need these huge sysfs reworks? :)

Because :

> > Well the blocker is really that when you create a new network namespace,
> > it wants to create a new loopback interface, but
> > /sys/devices/virtual/net/lo already exists. That's the same issue with

So at least we'd have to do something to allow creation of 'duplicate'
devices in different namespaces. It might be fine if we just ended up
with /sys/devices/virtual/net/lo, if created in a child net namespace,
be named /sys/devices/virtual/net/lo.childXYZ. Then userspace can
mount -t tmpfs none /sys/class/net and ln -s
/sys/devices/virtual/net/lo.childXYZ /sys/class/net/lo.

-serge

2008-10-08 00:23:53

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Hello,

Eric W. Biederman wrote:
> Tejun Heo <[email protected]> writes:
>
>> Hello, a bit of additions after some sleep.
>
>> Heh... it did sound like a plan but I don't think the plan would solve
>> the problem. filldir can't be put in rcu read critical section. :-p
>
> There is srcu and there is the trick of grabbing the reference count
> on the current sysfs_dirent over the filldir and dropping the rcu
> lock (which works for proc).

Hmmm... I'm probably missing something (and being lazy) but how does it
guarantee the validity of the next pointer after dropping the rcu lock?

> To cut down on lock overhead from user space accesses that works.
>
>>>> The revalidate on access model doesn't appear to have a way to track
>>>> remote renames. Something sysfs supports.
>>> Yeap, IIRC, one of the reasons why sysfs wasn't converted over to
>>> sysfs was because sysfs guarantees inode doesn't change over rename or
>>> move so that notifications keep working over renames.
>> s/over to sysfs/over to revalidation/ and s/inode/dentry/. Maybe we can
>> just ignore dnotify? :-(
>
> Well there are more cases than dnotify, there is the renaming of directories
> in sysfs, although rare that I think get awkward if we use revalidation.
>
> I'm still not certain how we can get the lock ordering so it doesn't
> cause us problems. I will look at revalidation and what the other
> distributed filesystems are doing and see if that might work. If it
> doesn't we need refactor the VFS locking.

Yeah, if we can make sysfs behave like other distributed filesystems, it
would be great. :-)

Thanks.

--
tejun

2008-10-08 00:41:58

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

On Tue, Oct 07, 2008 at 07:12:03PM -0500, Serge E. Hallyn wrote:
> Quoting Greg KH ([email protected]):
> > On Tue, Oct 07, 2008 at 05:54:24PM -0500, Serge E. Hallyn wrote:
> > > Quoting Greg KH ([email protected]):
> > > > On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
> > > > > Unless someone will give an example of how having multiple superblocks
> > > > > sharing inodes is a problem in practice for sysfs and call it good
> > > > > for 2.6.28. Certainly it shouldn't be an issue if the network namespace
> > > > > code is compiled out. And it should greatly improve testing of the
> > > > > network namespace to at least have access to sysfs.
> > > >
> > > > But if the network namespace code is in? THen we have problems, right?
> > > > And that's the whole point here.
> > > >
> > > > The fact that you are trying to limit userspace view of in-kernel data
> > > > structures, based on that specific user, is, in my opinion, crazy.
> > > >
> > > > Why not just keep all users from seeing sysfs, and then have a user
> > > > daemon doing something on top of FUSE if you really want to see this
> > > > kind of stuff.
> > >
> > > Well the blocker is really that when you create a new network namespace,
> > > it wants to create a new loopback interface, but
> > > /sys/devices/virtual/net/lo already exists. That's the same issue with
> > > user namespace when the fair scheduler is enabled, which tries to
> > > re-create /sys/kernel/uids/0.
> > >
> > > Otherwise yeah at least for my own uses, containers wouldn't need to
> > > look at /sys at all.
> > >
> > > Heck you wouldn't even need FUSE, just mount -t tmpfs /sys/class/net
> > > and manually link the right devices from /sys/devices/virtual/net.
> >
> > Great, that sounds like a solution.
> >
> > So tell me again why we need these huge sysfs reworks? :)
>
> Because :
>
> > > Well the blocker is really that when you create a new network namespace,

No, wait. Why would you want to do such a thing in the first place?

> > > it wants to create a new loopback interface, but
> > > /sys/devices/virtual/net/lo already exists. That's the same issue with
>
> So at least we'd have to do something to allow creation of 'duplicate'
> devices in different namespaces. It might be fine if we just ended up
> with /sys/devices/virtual/net/lo, if created in a child net namespace,
> be named /sys/devices/virtual/net/lo.childXYZ. Then userspace can
> mount -t tmpfs none /sys/class/net and ln -s
> /sys/devices/virtual/net/lo.childXYZ /sys/class/net/lo.

ick.

I agree with Tejun here, what's this whole network namespace stuff, what
problems is it trying to solve and what are its goals?

thanks,

greg k-h

2008-10-08 00:46:19

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Greg KH <[email protected]> writes:

> On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
>> Unless someone will give an example of how having multiple superblocks
>> sharing inodes is a problem in practice for sysfs and call it good
>> for 2.6.28. Certainly it shouldn't be an issue if the network namespace
>> code is compiled out. And it should greatly improve testing of the
>> network namespace to at least have access to sysfs.
>
> But if the network namespace code is in? THen we have problems, right?

Possibly. Tejun and I looked at it a while ago and could not see any
problems that you could trigger by having multiple superblocks share
the same inodes but there might be some. I don't like that aspect
of it any more than Al does, I just haven't found an alternative
that works, nor have I seen a suggestion that is better.

> And that's the whole point here.

And my point is that compared to the other bugs Al found in sysfs
my contribution is miniscule and minor, and pretty much fixed
at this point.

Do I have to rewrite all of sysfs to the point where Al can not find
bugs in it before I can merge my changes?

If that is the standard while it seems ridiculous I can work with
it. I need something i can work with.

> The fact that you are trying to limit userspace view of in-kernel data
> structures, based on that specific user, is, in my opinion, crazy.
>
> Why not just keep all users from seeing sysfs, and then have a user
> daemon doing something on top of FUSE if you really want to see this
> kind of stuff.

As Serge said. The problem is very simply that with the network
namespaces we create devices in different namespaces that have
the same name. lo being the first one we hit. It wasn't my
idea to export them to sysfs but it has happened.

We need a way to put those devices in sysfs that doesn't break
sysfs and is in some form backwards compatible with code that
is running today.

Given that there is at least one directory per physical NIC
in sysfs that I need to handle, the easiest approach I have
found is simply to have a way of having entries with the
same name in the same directory and have a filter on the superblock
of sysfs so different ones are shown.

> The "leakage" just seems too hard to stop.

The goal is not to hide the fact you are in anamespace. So
information "leakage" isn't a problem. I simply need a configuration
where something works.

If I could do what I am trying to do with FUSE I would be happy to
do it that way, and for more esoteric things I have actively suggested
it.

> I know how the whole "we'll go back later and fix it up" stuff works,
> I've used that excuse too many times in the past myself. Never happens
> :)

Sometimes. That has to be balanced against the it is pointless to
submit patches effect. Which I am feeling pretty strongly right now.

A lot of false criticism and what feels like over reaction mixed in
with the few real bugs.

>> Greg I agree with Al that sysfs isn't perfect but we sure aren't going
>> to fix it if you keep dropping or taking years to merge every patch
>> from the people working on it, and then dropping those patches because
>> someone frowns at them.
>
> "years"? Come on, these did take a while due to travel and other stuff.

Yes. "years". This last round hasn't been too bad in comparison
3-4 months. Just before the start of the previous merge window to now.

> These are core kernel changes, and need time to ensure that they work
> properly, and get the proper review from people who understand this kind
> of stuff.

I totally agree. And for that I have no problem if they are not merged
into the stable release. At the same time if they are not at least in
a tree that merges into linux-next no one takes them seriously and the
code is just ignored.

> And to call Al a generic "someone", is just rude and disrespectful. I
> trust his opinion in this area far more than I do yours, to be honest.

> This whole series is dropped, if you want to resubmit them, feel free
> to, _after_ adressing his issues.

His issues are that sysfs has problems.

Except for one I have already addressed Al's issues that were in my code.

Greg do you understand sysfs well enough to know if I have addressed Al's issues?

> bah,

Eric

2008-10-08 01:06:18

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Tejun Heo <[email protected]> writes:

> Hmmm... I'm probably missing something (and being lazy) but how does it
> guarantee the validity of the next pointer after dropping the rcu lock?

So in next_tid which is essentially what we would be doing.
I grab the rcu_lock. Check something to see if the task I have
is still on the list, if it is then I know the next is valid until
the end of the rcu grace period. Then I follow the next pointer,
and grab the lock again.

rcu is pain to get right but at least it is localized pain.

>> I'm still not certain how we can get the lock ordering so it doesn't
>> cause us problems. I will look at revalidation and what the other
>> distributed filesystems are doing and see if that might work. If it
>> doesn't we need refactor the VFS locking.
>
> Yeah, if we can make sysfs behave like other distributed filesystems, it
> would be great. :-)

I don't think we can make it work but I think we need to exhaust that
avenue before saying that the VFS has to change.

Eric

2008-10-08 01:36:18

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Greg KH <[email protected]> writes:

> bah,

Greg I want to thank you for working with me and making time to do
this even though you are not a fan of what I am doing and generally
keeping this on a technical basis. I do appreciate it, even if I
am frustrated right now, and feeling like the bar has been set
impossibly hight.

Eric

2008-10-08 14:18:58

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Quoting Greg KH ([email protected]):
> On Tue, Oct 07, 2008 at 07:12:03PM -0500, Serge E. Hallyn wrote:
> > Quoting Greg KH ([email protected]):
> > > On Tue, Oct 07, 2008 at 05:54:24PM -0500, Serge E. Hallyn wrote:
> > > > Quoting Greg KH ([email protected]):
> > > > > On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
> > > > > > Unless someone will give an example of how having multiple superblocks
> > > > > > sharing inodes is a problem in practice for sysfs and call it good
> > > > > > for 2.6.28. Certainly it shouldn't be an issue if the network namespace
> > > > > > code is compiled out. And it should greatly improve testing of the
> > > > > > network namespace to at least have access to sysfs.
> > > > >
> > > > > But if the network namespace code is in? THen we have problems, right?
> > > > > And that's the whole point here.
> > > > >
> > > > > The fact that you are trying to limit userspace view of in-kernel data
> > > > > structures, based on that specific user, is, in my opinion, crazy.
> > > > >
> > > > > Why not just keep all users from seeing sysfs, and then have a user
> > > > > daemon doing something on top of FUSE if you really want to see this
> > > > > kind of stuff.
> > > >
> > > > Well the blocker is really that when you create a new network namespace,
> > > > it wants to create a new loopback interface, but
> > > > /sys/devices/virtual/net/lo already exists. That's the same issue with
> > > > user namespace when the fair scheduler is enabled, which tries to
> > > > re-create /sys/kernel/uids/0.
> > > >
> > > > Otherwise yeah at least for my own uses, containers wouldn't need to
> > > > look at /sys at all.
> > > >
> > > > Heck you wouldn't even need FUSE, just mount -t tmpfs /sys/class/net
> > > > and manually link the right devices from /sys/devices/virtual/net.
> > >
> > > Great, that sounds like a solution.
> > >
> > > So tell me again why we need these huge sysfs reworks? :)
> >
> > Because :
> >
> > > > Well the blocker is really that when you create a new network namespace,
>
> No, wait. Why would you want to do such a thing in the first place?

So I can have db2, a few apaches, etc, each in different containers with
their network devices and their own ipfilter rules.

So I can take one of those apache containers and migrate it along with
its ip address to another machine.

So I can do the openvz/vserver thing and run a 'virtual machine' (or 50)
without the overhead of another full OS. Now like Eric said our goal
isn't to fool the distro installed in the container and not let it know
it's in a container. But the same tools should be able to administer
inside a container as outside a container. That was the reason for the
filtering of /proc to show the right pids inside a container, for
instance.

So given that, what I describe below should probably suffice. Though I
wonder whether things depending on uevents will get messed up in a
container. It should be fine, I assume, so long as the devicename (lo)
is sent along withthe filename (lo.childXYZ).

> > > > it wants to create a new loopback interface, but
> > > > /sys/devices/virtual/net/lo already exists. That's the same issue with
> >
> > So at least we'd have to do something to allow creation of 'duplicate'
> > devices in different namespaces. It might be fine if we just ended up
> > with /sys/devices/virtual/net/lo, if created in a child net namespace,
> > be named /sys/devices/virtual/net/lo.childXYZ. Then userspace can
> > mount -t tmpfs none /sys/class/net and ln -s
> > /sys/devices/virtual/net/lo.childXYZ /sys/class/net/lo.
>
> ick.
>
> I agree with Tejun here, what's this whole network namespace stuff, what
> problems is it trying to solve and what are its goals?
>
> thanks,
>
> greg k-h

2008-10-14 01:20:28

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Tejun Heo <[email protected]> writes:

> Hello, Greg.
>
> Greg KH wrote:
>> On Tue, Oct 07, 2008 at 01:27:17AM -0700, Eric W. Biederman wrote:
>>> Unless someone will give an example of how having multiple superblocks
>>> sharing inodes is a problem in practice for sysfs and call it good
>>> for 2.6.28. Certainly it shouldn't be an issue if the network namespace
>>> code is compiled out. And it should greatly improve testing of the
>>> network namespace to at least have access to sysfs.
>>
>> But if the network namespace code is in? THen we have problems, right?
>> And that's the whole point here.
>>
>> The fact that you are trying to limit userspace view of in-kernel data
>> structures, based on that specific user, is, in my opinion, crazy.
>
> Well, that's the whole point of all the namespace stuff. If we're
> gonna do namespaces, view of in-kernel data structures need to be
> limited and modified one way or the other.
>
>> Why not just keep all users from seeing sysfs, and then have a user
>> daemon doing something on top of FUSE if you really want to see this
>> kind of stuff.
>
> That sounds nice. Out of ignorance, how is the /proc dealt with?
> Maybe we can have some unified approach for this multiple views of the
> system stuff.

/proc uses just about every trick in the book to make this work.

/proc/sys uses a magic d_compare method.

/proc/net becomes a symlink to /proc/<pid>/net and we get completely
different directory trees below that. Shortly that code will
use auto mounts of a proc_net filesystem, that has different
super blocks one for each different network namespace.

/proc/sysvipc/* simply returns different values from it's files depending
upon which process is reading them.

/proc itself has multiple super blocks one for each different pid namespace.

The long term direction is to be able to see everything at once. If
you mount all of the filesystems multiple times in the proper way.
That allows monitoring software to watch what is going on inside of a
container without a challenge, and it makes it a user space policy
decision how much an individual container sees.

For sysfs there isn't have the option of putting things under
/proc/<pid>, the directories I am interested in (at least for network
devices) are scattered all over sysfs and come and go with device
hotplug events so I don't see a realistic way of splitting those
directories out into their own filesystem.

>From a user interface design perspective I don't see a good
alternative to having /sys/class/net/, /sys/virtual/net/, and all of
the other directories different based on network namespace. Then have
the network namespace be specified by super block. Looking at current
and doing the magic d_compare trick almost works, but it runs into
problems with sysfs_get_dentry.

>From the perspective of the internal sysfs data structures tagged
dirents are clean and simple so I don't see a reason to re-architect
that.

I have spent the last several days looking deeply at what the vfs
can do, and how similar situations are handled. My observations
are:
1) exportfs from nfsd is similar to our kobject to sysfs_dirent layer,
and solves that set of problems cleanly, including remote rename.
So there is no fundamental reason we need inverted twisting locking in
sysfs, or otherwise violate existing vfs rules.
2) i_mutex seems to protect very little if anything that we care about.
The dcache has it's own set of locks. So we may be able to completely
avoid taking i_mutex in sysfs and simplify things enormously.
Currently I believe we are very similar to ocfs2 in terms of locking
requirements.
3) For i_notify and d_notify that seems to require pinning the inode
or the dentry in question, so I see no reason why a d_revalidate
style of update will have problems.
4) For finer locking granularity of readdir. All we need to do is do
the semi-expensive restart for each dirent, and the problem is
trivially solved.
5) Large directories are a potential performance problem in sysfs.

So it appears that the path forward is:
- Cleanup sysfs locking and other issues.
- Return to the network namespace code.

Possibly with an intermediate step of only showing the network
devices in the initial network namespace in sysfs.

> Can somebody hammer the big picture regarding namespaces into my
> small head?

100,000 foot view. A namespace introduces a scope so multiple
objects can have the same name. Like network devices.

10,000 foot view. The network namespace looks to user space
as if the kernel has multiple independent network stacks.

1000 foot view. I have two network devices named lo, and sysfs
does not currently have a place for me to put them.

Leakage and being able to fool an application that it has the entire
kernel to itself are not concerns. The goal is simply to get the
entire object name to object translation boundary and the namespace
work is done. We have largely achieved, and the code to do
so once complete is reasonable enough that it should be no
worse than dealing with any other kernel bug.

Eric

2008-10-14 08:01:20

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Hello, Eric.

Eric W. Biederman wrote:
>> That sounds nice. Out of ignorance, how is the /proc dealt with?
>> Maybe we can have some unified approach for this multiple views of the
>> system stuff.
>
> /proc uses just about every trick in the book to make this work.
>
> /proc/sys uses a magic d_compare method.
>
> /proc/net becomes a symlink to /proc/<pid>/net and we get completely
> different directory trees below that. Shortly that code will
> use auto mounts of a proc_net filesystem, that has different
> super blocks one for each different network namespace.
>
> /proc/sysvipc/* simply returns different values from it's files depending
> upon which process is reading them.
>
> /proc itself has multiple super blocks one for each different pid namespace.

Aieeeee... I wanna run screaming and crying. Any chance these can be
done using FUSE? FUSE is pretty flexible and should be able to
emulate most of proc files w/o too much difficulty.

> The long term direction is to be able to see everything at once. If
> you mount all of the filesystems multiple times in the proper way.
> That allows monitoring software to watch what is going on inside of a
> container without a challenge, and it makes it a user space policy
> decision how much an individual container sees.
>
> For sysfs there isn't have the option of putting things under
> /proc/<pid>, the directories I am interested in (at least for network
> devices) are scattered all over sysfs and come and go with device
> hotplug events so I don't see a realistic way of splitting those
> directories out into their own filesystem.
>
> From a user interface design perspective I don't see a good
> alternative to having /sys/class/net/, /sys/virtual/net/, and all of
> the other directories different based on network namespace. Then have
> the network namespace be specified by super block. Looking at current
> and doing the magic d_compare trick almost works, but it runs into
> problems with sysfs_get_dentry.
>

And can we do the same thing for sysfs using FUSE? So that not only
the policy but also the implementation is in userland? The changes
are quite pervasive and makes the whole thing pretty difficult to
follow.

> From the perspective of the internal sysfs data structures tagged
> dirents are clean and simple so I don't see a reason to re-architect
> that.

Heh... you know I have some reservations on that one too. :-)

> I have spent the last several days looking deeply at what the vfs
> can do, and how similar situations are handled. My observations
> are:

Thanks. Much appreciated.

> 1) exportfs from nfsd is similar to our kobject to sysfs_dirent layer,
> and solves that set of problems cleanly, including remote rename.
> So there is no fundamental reason we need inverted twisting locking in
> sysfs, or otherwise violate existing vfs rules.

eGreat. IIRC, it does it by not moving the existing dentry but
invalidating it, right?

The current twisted state is more of what's left from the original
tree-of-dentries-and-inodes implementation. It would be great to do
proper distributed fs instead.

> 2) i_mutex seems to protect very little if anything that we care about.
> The dcache has it's own set of locks. So we may be able to completely
> avoid taking i_mutex in sysfs and simplify things enormously.
> Currently I believe we are very similar to ocfs2 in terms of locking
> requirements.

I think the timestamps are one of the things it protects.

> 3) For i_notify and d_notify that seems to require pinning the inode
> or the dentry in question, so I see no reason why a d_revalidate
> style of update will have problems.

Because the existing notifications won't be moved over to the new
dentry. dnotify wouldn't work the same way. ISTR that was the reason
why I didn't do the d_revalidate thing, but I don't think it really
matters. dnotify on sysfs nodes doesn't work properly anyway.

> 4) For finer locking granularity of readdir. All we need to do is do
> the semi-expensive restart for each dirent, and the problem is
> trivially solved.

That can show the same entry multiple times or skip existing entries.
I think it's better to put fake entries and implement iterators.

> 5) Large directories are a potential performance problem in sysfs.

Yes, it is. It hasn't been an issue till now. You're worrying about
look up performance, right? If that's a real concern we can link sd's
into a hash table, but I'm not sure tho. For listing, O(n) is the
best we can do anyway and after the initial lookup, the result would
be cached via dcache anyway, so I'm not really sure how much adding a
hashtable will buy us.

> So it appears that the path forward is:
> - Cleanup sysfs locking and other issues.
> - Return to the network namespace code.
>
> Possibly with an intermediate step of only showing the network
> devices in the initial network namespace in sysfs.
>
>> Can somebody hammer the big picture regarding namespaces into my
>> small head?
>
> 100,000 foot view. A namespace introduces a scope so multiple
> objects can have the same name. Like network devices.
>
> 10,000 foot view. The network namespace looks to user space
> as if the kernel has multiple independent network stacks.
>
> 1000 foot view. I have two network devices named lo, and sysfs
> does not currently have a place for me to put them.
>
> Leakage and being able to fool an application that it has the entire
> kernel to itself are not concerns. The goal is simply to get the
> entire object name to object translation boundary and the namespace
> work is done. We have largely achieved, and the code to do
> so once complete is reasonable enough that it should be no
> worse than dealing with any other kernel bug.

Yes, I'm aware of the goals. What I'm curious about is the consensus
regarding network namespace and all its implications. It adds a lot
of complexities over a lot of places. e.g. following the sysfs code
becomes quite a bit more difficult after the namespace changes (maybe
it's just me but still). So, I was asking whether people generally
agree that having the namespace thing is worth the added complexities.

I think it serves pretty small group of users. Hosting service
providers and people trying to migrate processes from one machine to
another, both of which can be served pretty well with virtualization.
It does have higher overhead both processing power and memory wise but
IIUC the former is being actively worked on w/ new processor features
like nested paging tables and all and memory is really cheap these
days, so I'm a bit skeptical how much this is needed and how much we
should pay for it.

Another venue to explore is whether the partial view of proc and sysfs
can be implemented in less pervasive way. Implementing it via FUSE
might not be easier per-se but I think it would be better to do it
that way if we can instead of adding complexities to both proc and
sysfs.

One last thing that came to mind is, how would uevents be handled?
ie. what happens if a network card which is presented as ethN in the
namespace goes away? How does the system deal with it?

Thanks.

--
tejun

2008-10-14 12:20:26

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Tejun Heo <[email protected]> writes:

> Aieeeee... I wanna run screaming and crying. Any chance these can be
> done using FUSE? FUSE is pretty flexible and should be able to
> emulate most of proc files w/o too much difficulty.

I don't see how FUSE can help. The problem is getting the information
out of the kernel, and not breaking backwards compatiblity while we
do it. As I understand FUSE it just allows for user space filesystems.
Which is great if I want to hide information.

> And can we do the same thing for sysfs using FUSE? So that not only
> the policy but also the implementation is in userland? The changes
> are quite pervasive and makes the whole thing pretty difficult to
> follow.

I don't see how. If userspace doesn't have the information I don't
see how placing a filter will allow it to show up there.

The challenge is to not conflict on network device names. If someone can
think of where we can put the network devices that are in different
network namespaces in sysfs so they don't conflict when they have the
same name I have no problem with that. But where can we put them?

>> From the perspective of the internal sysfs data structures tagged
>> dirents are clean and simple so I don't see a reason to re-architect
>> that.
>
> Heh... you know I have some reservations on that one too. :-)

Well compared to the rest of it the part with dirents is just a handful
of lines of code. The vfs part is the expensive and hairy part.

>> I have spent the last several days looking deeply at what the vfs
>> can do, and how similar situations are handled. My observations
>> are:
>
> Thanks. Much appreciated.
>
>> 1) exportfs from nfsd is similar to our kobject to sysfs_dirent layer,
>> and solves that set of problems cleanly, including remote rename.
>> So there is no fundamental reason we need inverted twisting locking in
>> sysfs, or otherwise violate existing vfs rules.
>
> eGreat. IIRC, it does it by not moving the existing dentry but
> invalidating it, right?
>
> The current twisted state is more of what's left from the original
> tree-of-dentries-and-inodes implementation. It would be great to do
> proper distributed fs instead.

Yes. I am looking at that.

>> 2) i_mutex seems to protect very little if anything that we care about.
>> The dcache has it's own set of locks. So we may be able to completely
>> avoid taking i_mutex in sysfs and simplify things enormously.
>> Currently I believe we are very similar to ocfs2 in terms of locking
>> requirements.
>
> I think the timestamps are one of the things it protects.

Yes. I think parts of the page cache and anything in the inode itself
is protected by i_mutex. As for timestamsp or anything else that
we really care about we can and should put them in sysfs_dirent and
we can have the stat method recreate it, and possibly have d_revalidate
refresh it.

>> 3) For i_notify and d_notify that seems to require pinning the inode
>> or the dentry in question, so I see no reason why a d_revalidate
>> style of update will have problems.
>
> Because the existing notifications won't be moved over to the new
> dentry. dnotify wouldn't work the same way. ISTR that was the reason
> why I didn't do the d_revalidate thing, but I don't think it really
> matters. dnotify on sysfs nodes doesn't work properly anyway.

Reasonable. I have seen two ways of handling rename properly.
Some weird variant d_splice_alias or some cleaner variant of what
we are doing today.

>> 4) For finer locking granularity of readdir. All we need to do is do
>> the semi-expensive restart for each dirent, and the problem is
>> trivially solved.
>
> That can show the same entry multiple times or skip existing entries.
> I think it's better to put fake entries and implement iterators.

The guarantee is that we will see all entries that are there for the
duration of readdir, we order the directory by inode, and stick
the inode number in f_pos. So now we don't have the problem of
returning the same entry multiple times or skipping existing entries.

>> 5) Large directories are a potential performance problem in sysfs.
>
> Yes, it is. It hasn't been an issue till now. You're worrying about
> look up performance, right?

Lookup, create, unlink and if we drop the lock during readdir, readdir
restart. The all require a linear scan.

> If that's a real concern we can link sd's
> into a hash table, but I'm not sure tho. For listing, O(n) is the
> best we can do anyway and after the initial lookup, the result would
> be cached via dcache anyway, so I'm not really sure how much adding a
> hashtable will buy us.

Depends on how many devices people are adding and removing dynamically
I guess. sysctl has had that issue so I am thinking about it. I
figure we need to make things work properly first.

>> So it appears that the path forward is:
>> - Cleanup sysfs locking and other issues.
>> - Return to the network namespace code.
>>
>> Possibly with an intermediate step of only showing the network
>> devices in the initial network namespace in sysfs.
>>
>>> Can somebody hammer the big picture regarding namespaces into my
>>> small head?
>>
>> 100,000 foot view. A namespace introduces a scope so multiple
>> objects can have the same name. Like network devices.
>>
>> 10,000 foot view. The network namespace looks to user space
>> as if the kernel has multiple independent network stacks.
>>
>> 1000 foot view. I have two network devices named lo, and sysfs
>> does not currently have a place for me to put them.
>>
>> Leakage and being able to fool an application that it has the entire
>> kernel to itself are not concerns. The goal is simply to get the
>> entire object name to object translation boundary and the namespace
>> work is done. We have largely achieved, and the code to do
>> so once complete is reasonable enough that it should be no
>> worse than dealing with any other kernel bug.
>
> Yes, I'm aware of the goals. What I'm curious about is the consensus
> regarding network namespace and all its implications. It adds a lot
> of complexities over a lot of places.

Not really. It is really very straight forward. 99% of the modified
code simply has an extra pointer dereference.

Except for sysfs the network namespace code that has merged is in a
very usable state. There are a few little things like iptables
support that still needs some work. From a practical standpoint sysfs
was one of the first things I started working on and it is one of the
last things to be done.

> e.g. following the sysfs code
> becomes quite a bit more difficult after the namespace changes (maybe
> it's just me but still).

Some of it yes. Which asks for a more comprehensive solution. Part
of the challenge is that there has been insistence on an especially
generic solution, in sysfs and I'm not certain that has helped.

> So, I was asking whether people generally
> agree that having the namespace thing is worth the added complexities.

To my knowledge yes. Most of the cost is trivial, and it makes
a darn good excuse to clean up problem code.

> I think it serves pretty small group of users. Hosting service
> providers and people trying to migrate processes from one machine to
> another, both of which can be served pretty well with virtualization.
> It does have higher overhead both processing power and memory wise but
> IIUC the former is being actively worked on w/ new processor features
> like nested paging tables and all and memory is really cheap these
> days, so I'm a bit skeptical how much this is needed and how much we
> should pay for it.

So far sysfs is the most costly and the hardest part. Most of the
cost is in the noise and in the design.

One thing the namespaces fundamentally get you is scaling. You can
run probably 10x more environments on a single server. Which makes
then cheaper and available, on all hardware.

Beyond that there are people who actually just want to use a single
namespace for what you can do. They are general tools and are useful
in more ways than just checkpoint restart and virtualization.

Think what happens if you are a switch/router and you switch two
different networks both using overlaping addresses in the 10.x segment.

Or think how much easier it is to test routing with just a single machine.

All kinds of interesting uses.

> Another venue to explore is whether the partial view of proc and sysfs
> can be implemented in less pervasive way. Implementing it via FUSE
> might not be easier per-se but I think it would be better to do it
> that way if we can instead of adding complexities to both proc and
> sysfs.

This isn't a partial view thing really. This is how do I put it all
in there not have conflicts and preserve backwards compatibility.

In proc. I have work as hard as I can to build a design that will let
us see it all without sacrificing backwards compatibility. With /proc/<pid>
I have a natural place to put data in a per process view. I don't
have that in sysfs, and sysfs at some point stopped being about just
the hardware. So the only way I have found to have places for everything
is to do multiple mounts.

> One last thing that came to mind is, how would uevents be handled?
> ie. what happens if a network card which is presented as ethN in the
> namespace goes away? How does the system deal with it?

It is probably worth a double check. Coming in all physical network
devices happen in the initial network namespace so that direction isn't
a problem. Worse case I expect we figure out how to add a field that
specifies enough about the network namespace so the events can be relayed
to appropriate part of user space.

Eric

2008-10-14 18:53:37

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Quoting Tejun Heo ([email protected]):
> >> Can somebody hammer the big picture regarding namespaces into my
> >> small head?
> >
> > 100,000 foot view. A namespace introduces a scope so multiple
> > objects can have the same name. Like network devices.
> >
> > 10,000 foot view. The network namespace looks to user space
> > as if the kernel has multiple independent network stacks.
> >
> > 1000 foot view. I have two network devices named lo, and sysfs
> > does not currently have a place for me to put them.
> >
> > Leakage and being able to fool an application that it has the entire
> > kernel to itself are not concerns. The goal is simply to get the
> > entire object name to object translation boundary and the namespace
> > work is done. We have largely achieved, and the code to do
> > so once complete is reasonable enough that it should be no
> > worse than dealing with any other kernel bug.
>
> Yes, I'm aware of the goals. What I'm curious about is the consensus
> regarding network namespace and all its implications. It adds a lot
> of complexities over a lot of places. e.g. following the sysfs code
> becomes quite a bit more difficult after the namespace changes (maybe
> it's just me but still). So, I was asking whether people generally
> agree that having the namespace thing is worth the added complexities.
>
> I think it serves pretty small group of users. Hosting service

I don't think that's true.

Let's say i want to run debootstrap and set up a minimal image
to run postfix. Now if I want to run that on my laptop as its own
minimal separate machine, I need to run qemu or kvm. That's huge.

Once we finally get network namespaces(-sysfs) finished, I can set up a
10-line config file, download and install
https://sourceforge.net/projects/lxc/, run

lxc-execute -n postfix-cont /bin/bash

and voila, I have postfix running as though on a separate machine,
but with none of the kvm/qemu overhead. Which means that instead
of being able to do one at a time, i can do... hundreds? So I
think this is something everyone will find useful - but of course
I *am* biased :)

> providers and people trying to migrate processes from one machine to
> another, both of which can be served pretty well with virtualization.
> It does have higher overhead both processing power and memory wise but
> IIUC the former is being actively worked on w/ new processor features
> like nested paging tables and all and memory is really cheap these
> days, so I'm a bit skeptical how much this is needed and how much we
> should pay for it.
>
> Another venue to explore is whether the partial view of proc and sysfs
> can be implemented in less pervasive way. Implementing it via FUSE
> might not be easier per-se but I think it would be better to do it

Again fuse doesn't address the *core* issue (sysfs needing a way to
create files for multiple devicenames with same name). But I believe
Benjamin was looking into a minimal patch to fix that. Benjamin,
have you gotten anywhere with that?

> that way if we can instead of adding complexities to both proc and
> sysfs.
>
> One last thing that came to mind is, how would uevents be handled?
> ie. what happens if a network card which is presented as ethN in the
> namespace goes away? How does the system deal with it?
>
> Thanks.
>
> --
> tejun

2008-10-15 00:50:24

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

"Serge E. Hallyn" <[email protected]> writes:

> Again fuse doesn't address the *core* issue (sysfs needing a way to
> create files for multiple devicenames with same name). But I believe
> Benjamin was looking into a minimal patch to fix that. Benjamin,
> have you gotten anywhere with that?

I would love to hear a minimal strategy for that.

The only minimal strategy user space wise is to create multiple superblocks.
Anything else I an think of violates backwards compatibility.

Eric

2008-10-15 11:08:10

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Eric W. Biederman wrote:
> Tejun Heo <[email protected]> writes:
>
>> Aieeeee... I wanna run screaming and crying. Any chance these can be
>> done using FUSE? FUSE is pretty flexible and should be able to
>> emulate most of proc files w/o too much difficulty.
>
> I don't see how FUSE can help. The problem is getting the information
> out of the kernel, and not breaking backwards compatiblity while we
> do it. As I understand FUSE it just allows for user space filesystems.
> Which is great if I want to hide information.

Well, you can modify the information too. FUSE can easily present
ethN as eth0 of a namespace. I don't know how it will play out with
the actual network interfaces tho.

>> And can we do the same thing for sysfs using FUSE? So that not only
>> the policy but also the implementation is in userland? The changes
>> are quite pervasive and makes the whole thing pretty difficult to
>> follow.
>
> I don't see how. If userspace doesn't have the information I don't
> see how placing a filter will allow it to show up there.
>
> The challenge is to not conflict on network device names. If someone can
> think of where we can put the network devices that are in different
> network namespaces in sysfs so they don't conflict when they have the
> same name I have no problem with that. But where can we put them?

I was thinking about just letting them be ethN with unique Ns and
letting FUSE server present some of them as ethX on namespaces.

>>> 2) i_mutex seems to protect very little if anything that we care about.
>>> The dcache has it's own set of locks. So we may be able to completely
>>> avoid taking i_mutex in sysfs and simplify things enormously.
>>> Currently I believe we are very similar to ocfs2 in terms of locking
>>> requirements.
>> I think the timestamps are one of the things it protects.
>
> Yes. I think parts of the page cache and anything in the inode itself
> is protected by i_mutex. As for timestamsp or anything else that
> we really care about we can and should put them in sysfs_dirent and
> we can have the stat method recreate it, and possibly have d_revalidate
> refresh it.

Some of the timestamps are not on the sysfs_dirent because 1. nobody
cared (the original sd implementation didn't preserve it) and
2. of memory overhead.

>>> 3) For i_notify and d_notify that seems to require pinning the inode
>>> or the dentry in question, so I see no reason why a d_revalidate
>>> style of update will have problems.
>> Because the existing notifications won't be moved over to the new
>> dentry. dnotify wouldn't work the same way. ISTR that was the reason
>> why I didn't do the d_revalidate thing, but I don't think it really
>> matters. dnotify on sysfs nodes doesn't work properly anyway.
>
> Reasonable. I have seen two ways of handling rename properly.
> Some weird variant d_splice_alias or some cleaner variant of what
> we are doing today.

FWIW, I think it would be just fine to invalidate the renamed dentry.

>>> 4) For finer locking granularity of readdir. All we need to do is do
>>> the semi-expensive restart for each dirent, and the problem is
>>> trivially solved.
>> That can show the same entry multiple times or skip existing entries.
>> I think it's better to put fake entries and implement iterators.
>
> The guarantee is that we will see all entries that are there for the
> duration of readdir, we order the directory by inode, and stick
> the inode number in f_pos. So now we don't have the problem of
> returning the same entry multiple times or skipping existing entries.

Right, great. :-)

>>> 5) Large directories are a potential performance problem in sysfs.
>> Yes, it is. It hasn't been an issue till now. You're worrying about
>> look up performance, right?
>
> Lookup, create, unlink and if we drop the lock during readdir, readdir
> restart. The all require a linear scan.
>
>> If that's a real concern we can link sd's
>> into a hash table, but I'm not sure tho. For listing, O(n) is the
>> best we can do anyway and after the initial lookup, the result would
>> be cached via dcache anyway, so I'm not really sure how much adding a
>> hashtable will buy us.
>
> Depends on how many devices people are adding and removing dynamically
> I guess. sysctl has had that issue so I am thinking about it. I
> figure we need to make things work properly first.

Yeap, let's think about optimization later. The problem hasn't come
up yet even on machines where the memory footprint of sysfs dentries
and inodes posed serious problems, so I don't think optimizing it is a
high priority at this point.

>>> Leakage and being able to fool an application that it has the entire
>>> kernel to itself are not concerns. The goal is simply to get the
>>> entire object name to object translation boundary and the namespace
>>> work is done. We have largely achieved, and the code to do
>>> so once complete is reasonable enough that it should be no
>>> worse than dealing with any other kernel bug.
>> Yes, I'm aware of the goals. What I'm curious about is the consensus
>> regarding network namespace and all its implications. It adds a lot
>> of complexities over a lot of places.
>
> Not really. It is really very straight forward. 99% of the modified
> code simply has an extra pointer dereference.
>
> Except for sysfs the network namespace code that has merged is in a
> very usable state. There are a few little things like iptables
> support that still needs some work. From a practical standpoint sysfs
> was one of the first things I started working on and it is one of the
> last things to be done.
>
>> e.g. following the sysfs code
>> becomes quite a bit more difficult after the namespace changes (maybe
>> it's just me but still).
>
> Some of it yes. Which asks for a more comprehensive solution. Part
> of the challenge is that there has been insistence on an especially
> generic solution, in sysfs and I'm not certain that has helped.

Well, I suppose most of that blame falls on me but I still can't bring
myself to agree with the current implementation. The biggest problem
I have is that the implementation doesn't really show in straight
forward manner what it tries to achieve (showing partial tree
depending on sb).

>> So, I was asking whether people generally agree that having the
>> namespace thing is worth the added complexities.
>
> To my knowledge yes. Most of the cost is trivial, and it makes
> a darn good excuse to clean up problem code.

Getting the clean up part in usually isn't a problem, right? But
getting in the actual namespace part is (and should be).

>> I think it serves pretty small group of users. Hosting service
>> providers and people trying to migrate processes from one machine to
>> another, both of which can be served pretty well with virtualization.
>> It does have higher overhead both processing power and memory wise but
>> IIUC the former is being actively worked on w/ new processor features
>> like nested paging tables and all and memory is really cheap these
>> days, so I'm a bit skeptical how much this is needed and how much we
>> should pay for it.
>
> So far sysfs is the most costly and the hardest part. Most of the
> cost is in the noise and in the design.

As ugly as it is, it's designed to export internal data structure
as-is to userland and bound tightly with the driver model in
not-so-orthodox way, so it's not very inclined to dance at the tune of
namespaces. :-)

> One thing the namespaces fundamentally get you is scaling. You can
> run probably 10x more environments on a single server. Which makes
> then cheaper and available, on all hardware.
>
> Beyond that there are people who actually just want to use a single
> namespace for what you can do. They are general tools and are useful
> in more ways than just checkpoint restart and virtualization.
>
> Think what happens if you are a switch/router and you switch two
> different networks both using overlaping addresses in the 10.x segment.
>
> Or think how much easier it is to test routing with just a single machine.
>
> All kinds of interesting uses.

Other than the scaling argument which probably applies mostly to the
hosting people, I also find other usages interesting but not sure how
popular they would be.

>> Another venue to explore is whether the partial view of proc and sysfs
>> can be implemented in less pervasive way. Implementing it via FUSE
>> might not be easier per-se but I think it would be better to do it
>> that way if we can instead of adding complexities to both proc and
>> sysfs.
>
> This isn't a partial view thing really. This is how do I put it all
> in there not have conflicts and preserve backwards compatibility.
>
> In proc. I have work as hard as I can to build a design that will let
> us see it all without sacrificing backwards compatibility. With /proc/<pid>
> I have a natural place to put data in a per process view. I don't
> have that in sysfs, and sysfs at some point stopped being about just
> the hardware. So the only way I have found to have places for everything
> is to do multiple mounts.

As I wrote above, FUSE can modify or create data in-flight, but it's
not like proc and sysfs are the only ones which need to be changed, so
it might not matter after all.

>> One last thing that came to mind is, how would uevents be handled?
>> ie. what happens if a network card which is presented as ethN in the
>> namespace goes away? How does the system deal with it?
>
> It is probably worth a double check. Coming in all physical network
> devices happen in the initial network namespace so that direction isn't
> a problem. Worse case I expect we figure out how to add a field that
> specifies enough about the network namespace so the events can be relayed
> to appropriate part of user space.

Thanks.

--
tejun

2008-10-15 13:42:40

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Quoting Eric W. Biederman ([email protected]):
> "Serge E. Hallyn" <[email protected]> writes:
>
> > Again fuse doesn't address the *core* issue (sysfs needing a way to
> > create files for multiple devicenames with same name). But I believe
> > Benjamin was looking into a minimal patch to fix that. Benjamin,
> > have you gotten anywhere with that?
>
> I would love to hear a minimal strategy for that.

Oh I just meant for kernel-space. So if a container is creating lo,
it will create a device named lo, but the sysfs file will be called
lo_1 or something.

> The only minimal strategy user space wise is to create multiple superblocks.
> Anything else I an think of violates backwards compatibility.

Yes, the above would require that the container either not mount
sysfs, ignore sysfs, or tweak sysfs using
mount -t tmpfs none /sys/class/net
mount --bind /sys/devices/virtual/net/lo_1 /sys/class/net/lo
or using fuse.

I'd definately prefer the sysfs tagging approach. But I'd prefer
the above over never being able to use network namespaces on a
standard distro (with sysfs enabled).

-serge

2008-10-15 13:55:14

by Benjamin Thery

[permalink] [raw]

Subject: Re: sysfs: tagged directories not merged completely yet

Serge E. Hallyn wrote:
> Quoting Eric W. Biederman ([email protected]):
>> "Serge E. Hallyn" <[email protected]> writes:
>>
>>> Again fuse doesn't address the *core* issue (sysfs needing a way to
>>> create files for multiple devicenames with same name). But I believe
>>> Benjamin was looking into a minimal patch to fix that. Benjamin,
>>> have you gotten anywhere with that?
>> I would love to hear a minimal strategy for that.
>
> Oh I just meant for kernel-space. So if a container is creating lo,
> it will create a device named lo, but the sysfs file will be called
> lo_1 or something.

I've started working on a patch that implements what you suggested:
add a suffix representing the netns to net device entries in sysfs.

It does work, but there are some issues (in addition to the fact that
these new device entries doesn't look very clean in /sys/class/net :) ).

For example, what is a good suffix for the device name.
We only have 4 bytes left for the suffix:

BUS_ID_SIZE - IFNAMSIZ = 4
sysfs net device
name length name length

Benjamin

>
>> The only minimal strategy user space wise is to create multiple superblocks.
>> Anything else I an think of violates backwards compatibility.
>
> Yes, the above would require that the container either not mount
> sysfs, ignore sysfs, or tweak sysfs using
> mount -t tmpfs none /sys/class/net
> mount --bind /sys/devices/virtual/net/lo_1 /sys/class/net/lo
> or using fuse.
>
> I'd definately prefer the sysfs tagging approach. But I'd prefer
> the above over never being able to use network namespaces on a
> standard distro (with sysfs enabled).
>
> -serge
>
>

--
B e n j a m i n T h e r y - BULL/DT/Open Software R&D

http://www.bull.com

2008-10-16 22:00:27