2001-11-16 10:08:14

by NeilBrown

[permalink] [raw]
Subject: Devlinks. Code. (Dcache abuse?)


I read in an interveiw recently:
http://www.newsforge.com/article.pl?sid=01/11/07/1516223&mode=thread

that Alan Cox was quoted as saying:
-------
One of the big issues therefore is how you handle naming and finding
objects that can shuffle around over time and between boots. Linus
favours a more devfs like approach, I favour a real /dev file system
and creating nodes on the fly by a usermode daemon. This allows you to
remember file permissions and policy more easily. It also means you
need a way to describe a device (traditionally a major/minor
numbering) which Linus doesn't like.
--------

And I thought that it was really time that this was resolved.

Herewith is a proposal for "a real /dev file system" that still uses
a "devfs like approach" to naming and avoids the problems of "traditional
... major/minor numbering".

Some of you with good memories might remember my little treatise from
May last year titled "devfs - the missing link" where I introduced the
idea of a "devlink". You can read it at:
http://lwn.net/2000/0518/a/nb-devfs.html

Below is code that implements devlinks, though some
of the details are different to the original proposal.

To recap, the fundamental idea is:

A device special file is a gateway between a user (admin)
controlled name space (the filesystem) and a kernel imposed name
space (major/minor numbers) that recognises and imposes access
control (owner/group/permissions).

The (a) problem with this is that major/minor numbers are too limited,
even if we go to lots more bits. Textual names are better.
So:

A devlink is a gateway between a user (or admin) controlled name
space (the filesystem) and a kernel imposed name space (the devfs
names) that recognises and imposes access control
(owner/group/permissions).

Notice that the only difference is that we now use names instead of
numbers. This is (as I understand it) the important difference that
is needed.

==== The user-land perspective ====

A Devlink looks like a symlink with the "sticky" (S_ISVTX) bit set.
Indeed, that is how it is stored on a filesystem.

To use devlinks you must apply the patch below and compile with DEVFS
support and with CONFIG_DEVLINK. You do not need to mount devfs at
all.

To create a devlink, you use mknod on a pre-existing symlink. The
mknod must request a device (block or char) with device number 0,0.
e.g.
ln -s tty /dev/TTY
mknod /dev/TTY b 0 0

This will create a devlink called "/dev/TTY" which points to the name
"tty" in devfs space.

ls -l /dev/TTY

will show the devlink.

ls -lL /dev/TTY

will show the traditional device special file.

Once you have turned a symlink into a devlink, chmod and chown will
work directly on the devlink so you can change the permissions and
ownership freely. The ownership and permissions are automatically
imposed on everything that the devlink points to.

A devlink can point to anything in the devfs namespace, not just
devices. e.g

ln -s ide /dev/ide
mknod /dev/ide b 0 0

will make /dev/ide be a devlink the the ide tree within devfs.
Then
cd /dev/ide
will work and allow you to move around the directory tree. Everything
in the directory tree will have the same ownership and permissions as
the devlink has, except for the execute bits. For directories, the
execute bits are copied from the read bits. For non-directories, the
execute bits are cleared.

You cannot do

ln -s '' /devices
mknod /devices b 0 0

and get the full devfs namespace under /devices, but only because
of a shortcoming the the devfs code (that normally would never be
asked to do this anyway). It could fairly easily be fixed but it
didn't seem worth the effort for this proof-of-concept.

When you 'cd' in a devlink tree, e.g.
cd /dev/ide/host0

the path you get with "/bin/pwd" looks just like what you would
expect: "/dev/ide/host0" in the example. The subtree of the devfs
namespace appears to be beneath the devlink almost as if the devlink
itself were a directory.

The inodes in the devlink tree have a different device number than
that of the devlink as though it were a separately mounted filesystem,
but there is no mountpoint appearing in /proc/mounts.

When you use devlinks you do still see major and minor device numbers,
but getting rid of them completely would be too much of a departure
from the Unix that we know and love. However once you use devlinks,
the actual numbers in use become much less important. You can have a
"real" /dev tree with access policy set up, and the kernel can assign
different device numbers at each reboot, and everything will continue
to work. Just the names need to stay the same.

=============Internal Perspective=======================

The objects that you reach when you follow a devlink are all
objects in a new filesystem, the "devlink" filesystem.
The devlink filesystem is an FS_SINGLE|FS_NOMOUNT filesystem that
can never be mounted.

When vfs_follow_link is called on a symlink with the S_ISVTX bit set,
"devlink_find" is called which creates (if necessary) a dentry and
inode that reflect the target devfs object, and attaches this dentry
underneath the devlink's dentry. The dentry has a name of ".", though
this is hidden from user-space (a name of "" might be a better
choice).

For this to work, vfs_follow_link needs to know the dentry that is
being followed as well as the name and parent directory, so the patch
adds another argument to vfs_follow_link.

There is a new lookup flags: LOOKUP_DEVLINK which can accompany
LOOKUP_FOLLOW, and means "don't follow devlinks".
This allows chmod to work directly on devlinks.

vfs_mknod is changed so that if a blk or chr device is requested, and
the name refers to a symlink, the S_ISVTX is forced on. Any other
chmod on a symlink is never allowed to change the S_ISVTX bit.

There is plenty of room for races with bits of devfs disappearing
underneath devlink. This is largely due to the current nature of
devfs. I think that the new devfs core that Richard is working on
will mean that these races can trivially be dealt with.

Probably the most ugly bits of the implementation are:
1/ the fact that dentries are attached to a devlink (which is
not a directory), and that these dentrys are in a different
filesystem to the devlink.
2/ the special casing to file the "." denties in follow_dotdot
and __d_path.

1/ is very different from the way things are now, but the more I think
about it the more comfortable I feel with it. There may be issues
about what happens when the devlink is unlinked, but they can be dealt
with I'm sure.
2/ is clearly ugly. I would like to find a nicer solution but as yet,
noon has presented itself. Possibly making the name "" instead of
"." would make it a bit less ugly, but not much.

Enough chatting. Here is the code, against 2.4.15-pre5. Comments
welcome.

NeilBrown


--- ./fs/proc/base.c 2001/11/16 03:56:34 1.1
+++ ./fs/proc/base.c 2001/11/16 09:06:47 1.2
@@ -911,7 +911,7 @@
{
char tmp[30];
sprintf(tmp, "%d", current->pid);
- return vfs_follow_link(nd,tmp);
+ return vfs_follow_link(nd,tmp,dentry);
}

static struct inode_operations proc_self_inode_operations = {
--- ./fs/proc/generic.c 2001/11/16 03:56:34 1.1
+++ ./fs/proc/generic.c 2001/11/16 09:06:48 1.2
@@ -219,7 +219,7 @@
static int proc_follow_link(struct dentry *dentry, struct nameidata *nd)
{
char *s=((struct proc_dir_entry *)dentry->d_inode->u.generic_ip)->data;
- return vfs_follow_link(nd, s);
+ return vfs_follow_link(nd, s, dentry);
}

static struct inode_operations proc_link_inode_operations = {
--- ./fs/nfs/symlink.c 2001/11/16 03:56:34 1.1
+++ ./fs/nfs/symlink.c 2001/11/16 09:06:48 1.2
@@ -93,7 +93,8 @@
{
struct inode *inode = dentry->d_inode;
struct page *page = NULL;
- int res = vfs_follow_link(nd, nfs_getlink(inode,&page));
+ int res = vfs_follow_link(nd, nfs_getlink(inode,&page),
+ dentry);
if (page) {
kunmap(page);
page_cache_release(page);
--- ./fs/ext2/symlink.c 2001/11/16 03:56:34 1.1
+++ ./fs/ext2/symlink.c 2001/11/16 09:06:48 1.2
@@ -29,7 +29,7 @@
static int ext2_follow_link(struct dentry *dentry, struct nameidata *nd)
{
char *s = (char *)dentry->d_inode->u.ext2_i.i_data;
- return vfs_follow_link(nd, s);
+ return vfs_follow_link(nd, s, dentry);
}

struct inode_operations ext2_fast_symlink_inode_operations = {
--- ./fs/sysv/symlink.c 2001/11/16 03:56:34 1.1
+++ ./fs/sysv/symlink.c 2001/11/16 09:06:48 1.2
@@ -16,7 +16,7 @@
static int sysv_follow_link(struct dentry *dentry, struct nameidata *nd)
{
char *s = (char *)dentry->d_inode->u.sysv_i.i_data;
- return vfs_follow_link(nd, s);
+ return vfs_follow_link(nd, s,dentry);
}

struct inode_operations sysv_fast_symlink_inode_operations = {
--- ./fs/ufs/symlink.c 2001/11/16 03:56:34 1.1
+++ ./fs/ufs/symlink.c 2001/11/16 09:06:48 1.2
@@ -36,7 +36,7 @@
static int ufs_follow_link(struct dentry *dentry, struct nameidata *nd)
{
char *s = (char *)dentry->d_inode->u.ufs_i.i_u1.i_symlink;
- return vfs_follow_link(nd, s);
+ return vfs_follow_link(nd, s, dentry);
}

struct inode_operations ufs_fast_symlink_inode_operations = {
--- ./fs/autofs/symlink.c 2001/11/16 03:56:34 1.1
+++ ./fs/autofs/symlink.c 2001/11/16 09:06:48 1.2
@@ -21,7 +21,7 @@
static int autofs_follow_link(struct dentry *dentry, struct nameidata *nd)
{
char *s=((struct autofs_symlink *)dentry->d_inode->u.generic_ip)->data;
- return vfs_follow_link(nd, s);
+ return vfs_follow_link(nd, s, dentry);
}

struct inode_operations autofs_symlink_inode_operations = {
--- ./fs/devfs/base.c 2001/11/16 03:56:34 1.1
+++ ./fs/devfs/base.c 2001/11/16 09:06:48 1.2
@@ -3049,7 +3049,7 @@
up_read (&symlink_rwsem);
if (copy)
{
- err = vfs_follow_link (nd, copy);
+ err = vfs_follow_link (nd, copy, dentry);
kfree (copy);
}
else err = -ENOMEM;
--- ./fs/autofs4/symlink.c 2001/11/16 03:56:35 1.1
+++ ./fs/autofs4/symlink.c 2001/11/16 09:06:48 1.2
@@ -23,7 +23,7 @@
{
struct autofs_info *ino = autofs4_dentry_ino(dentry);

- return vfs_follow_link(nd, ino->u.symlink);
+ return vfs_follow_link(nd, ino->u.symlink, dentry);
}

struct inode_operations autofs4_symlink_inode_operations = {
--- ./fs/freevxfs/vxfs_immed.c 2001/11/16 03:56:35 1.1
+++ ./fs/freevxfs/vxfs_immed.c 2001/11/16 09:06:48 1.2
@@ -101,7 +101,7 @@
{
struct vxfs_inode_info *vip = VXFS_INO(dp->d_inode);

- return (vfs_follow_link(np, vip->vii_immed.vi_immed));
+ return (vfs_follow_link(np, vip->vii_immed.vi_immed, dentry));
}

/**
--- ./fs/jffs2/symlink.c 2001/11/16 03:56:35 1.1
+++ ./fs/jffs2/symlink.c 2001/11/16 09:06:48 1.2
@@ -99,7 +99,7 @@
if (IS_ERR(buf))
return PTR_ERR(buf);

- ret = vfs_follow_link(nd, buf);
+ ret = vfs_follow_link(nd, buf, dentry);
kfree(buf);
return ret;
}
--- ./fs/ext3/symlink.c 2001/11/16 04:32:20 1.1
+++ ./fs/ext3/symlink.c 2001/11/16 09:06:48 1.2
@@ -30,7 +30,7 @@
static int ext3_follow_link(struct dentry *dentry, struct nameidata *nd)
{
char *s = (char *)dentry->d_inode->u.ext3_i.i_data;
- return vfs_follow_link(nd, s);
+ return vfs_follow_link(nd, s, dentry);
}

struct inode_operations ext3_fast_symlink_inode_operations = {
--- ./fs/attr.c 2001/11/16 03:56:35 1.1
+++ ./fs/attr.c 2001/11/16 09:06:48 1.2
@@ -80,6 +80,13 @@
if (ia_valid & ATTR_CTIME)
inode->i_ctime = attr->ia_ctime;
if (ia_valid & ATTR_MODE) {
+ if (S_ISLNK(inode->i_mode) &&
+ ! (ia_valid&ATTR_FORCE)) {
+ if (inode->i_mode & S_ISVTX)
+ attr->ia_mode |= S_ISVTX;
+ else
+ attr->ia_mode &= ~S_ISVTX;
+ }
inode->i_mode = attr->ia_mode;
if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
inode->i_mode &= ~S_ISGID;
--- ./fs/open.c 2001/11/16 03:56:35 1.1
+++ ./fs/open.c 2001/11/16 09:06:50 1.2
@@ -236,7 +236,7 @@
struct inode * inode;
struct iattr newattrs;

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_devlink(filename, &nd);
if (error)
goto out;
inode = nd.dentry->d_inode;
@@ -280,7 +280,7 @@
struct inode * inode;
struct iattr newattrs;

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_devlink(filename, &nd);

if (error)
goto out;
@@ -491,7 +491,7 @@
int error;
struct iattr newattrs;

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_devlink(filename, &nd);
if (error)
goto out;
inode = nd.dentry->d_inode;
@@ -581,7 +581,7 @@
struct nameidata nd;
int error;

- error = user_path_walk(filename, &nd);
+ error = user_path_walk_devlink(filename, &nd);
if (!error) {
error = chown_common(nd.dentry, user, group);
path_release(&nd);
--- ./fs/namei.c 2001/11/16 03:56:35 1.1
+++ ./fs/namei.c 2001/11/16 09:06:50 1.2
@@ -419,6 +419,13 @@
spin_unlock(&dcache_lock);
dput(nd->dentry);
nd->dentry = dentry;
+#ifdef CONFIG_DEVLINK
+ /* Can't leave the current directory
+ * as a devlink ...
+ */
+ if (S_ISLNK(dentry->d_inode->i_mode))
+ continue;
+#endif
break;
}
parent=nd->mnt->mnt_parent;
@@ -591,7 +598,9 @@
;
inode = dentry->d_inode;
if ((lookup_flags & LOOKUP_FOLLOW)
- && inode && inode->i_op && inode->i_op->follow_link) {
+ && inode && inode->i_op && inode->i_op->follow_link
+ && ((inode->i_mode&S_ISVTX)==0 || (lookup_flags& LOOKUP_DEVLINK)==0)
+ ) {
err = do_follow_link(dentry, nd);
dput(dentry);
if (err)
@@ -1215,6 +1224,20 @@
if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
goto exit_lock;

+#ifdef CONFIG_DEVLINK
+ if ((S_ISCHR(mode) || S_ISBLK(mode))
+ && dev == MKDEV(0,0)
+ && dentry->d_inode
+ && S_ISLNK(dentry->d_inode->i_mode)) {
+ struct iattr attr;
+ attr.ia_valid = ATTR_MODE | ATTR_FORCE;
+ attr.ia_mode = (dentry->d_inode->i_mode & S_IFMT)
+ | (mode & S_IRWXUGO)
+ | S_ISVTX;
+ error = notify_change(dentry, &attr);
+ goto exit_lock;
+ }
+#endif
error = may_create(dir, dentry);
if (error)
goto exit_lock;
@@ -1943,13 +1966,32 @@
}

static inline int
-__vfs_follow_link(struct nameidata *nd, const char *link)
+__vfs_follow_link(struct nameidata *nd, const char *link, struct dentry *dentry)
{
int res = 0;
char *name;
if (IS_ERR(link))
goto fail;

+#ifdef CONFIG_DEVLINK
+ /* If dentry->d_inode is a symlink with sticky bit set, then
+ * this is a devlink. That means that we lookup "link"
+ * in devfs and make a dentry+inode in the devlink filesystem
+ * and then attach the dentry just under this dentry
+ * Kind-a gross I know...
+ */
+ if (!(nd->mnt->mnt_flags & MNT_NODEV)
+ && dentry->d_inode
+ && (dentry->d_inode->i_mode & S_ISVTX)) {
+ dentry = devlink_find(dentry, link);
+ if (IS_ERR(dentry))
+ return PTR_ERR(dentry);
+ dput(nd->dentry);
+ nd->dentry = dentry;
+ nd->last_type = LAST_BIND;
+ return 0;
+ }
+#endif
if (*link == '/') {
path_release(nd);
if (!walk_init_root(link, nd))
@@ -1976,9 +2018,9 @@
return PTR_ERR(link);
}

-int vfs_follow_link(struct nameidata *nd, const char *link)
+int vfs_follow_link(struct nameidata *nd, const char *link, struct dentry *dentry)
{
- return __vfs_follow_link(nd, link);
+ return __vfs_follow_link(nd, link, dentry);
}

/* get the link contents into pagecache */
@@ -2020,7 +2062,7 @@
{
struct page *page = NULL;
char *s = page_getlink(dentry, &page);
- int res = __vfs_follow_link(nd, s);
+ int res = __vfs_follow_link(nd, s, dentry);
if (page) {
kunmap(page);
page_cache_release(page);
--- ./fs/devlink.c 2001/11/16 03:56:35 1.1
+++ ./fs/devlink.c 2001/11/16 09:06:50 1.2
@@ -0,0 +1,493 @@
+
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/devfs_fs.h>
+#include <linux/devfs_fs_kernel.h>
+#include <linux/sem.h>
+#include <linux/stat.h>
+
+/*
+ * devlink filesystem
+ * this filesystem present the devfs_entry tree
+ * as a filesystem.
+ * All permissions are inherited from a devlink "mountpoint",
+ * which is really an attachment point.
+ *
+ * Directories appear as directories
+ * char/block specials as those char/block specials
+ * symlinks as .... devlinks?
+ *
+ * dentry trees are attached under devlinks, entirely contravening any
+ * sense of taste
+ */
+
+#define DEVLINK_MAGIC 0xde5501ec
+
+/* ------------------------------------------------------------------*/
+/* Following extracted from devfs/base.c */
+
+struct directory_type
+{
+ struct devfs_entry *first;
+ struct devfs_entry *last;
+ unsigned int num_removable;
+};
+
+struct file_type
+{
+ unsigned long size;
+};
+
+struct device_type
+{
+ unsigned short major;
+ unsigned short minor;
+};
+
+struct fcb_type /* File, char, block type */
+{
+ uid_t default_uid;
+ gid_t default_gid;
+ void *ops;
+ union
+ {
+ struct file_type file;
+ struct device_type device;
+ }
+ u;
+ unsigned char auto_owner:1;
+ unsigned char aopen_notify:1;
+ unsigned char removable:1; /* Belongs in device_type, but save space */
+ unsigned char open:1; /* Not entirely correct */
+ unsigned char autogen:1; /* Belongs in device_type, but save space */
+};
+
+struct symlink_type
+{
+ unsigned int length; /* Not including the NULL-termimator */
+ char *linkname; /* This is NULL-terminated */
+};
+
+struct fifo_type
+{
+ uid_t uid;
+ gid_t gid;
+};
+
+struct devfs_inode /* This structure is for "persistent" inode storage */
+{
+ time_t atime;
+ time_t mtime;
+ time_t ctime;
+ unsigned int ino; /* Inode number as seen in the VFS */
+ struct dentry *dentry;
+ umode_t mode;
+ uid_t uid;
+ gid_t gid;
+};
+
+struct devfs_entry
+{
+ void *info;
+ union
+ {
+ struct directory_type dir;
+ struct fcb_type fcb;
+ struct symlink_type symlink;
+ struct fifo_type fifo;
+ }
+ u;
+ struct devfs_entry *prev; /* Previous entry in the parent directory */
+ struct devfs_entry *next; /* Next entry in the parent directory */
+ struct devfs_entry *parent; /* The parent directory */
+ struct devfs_entry *slave; /* Another entry to unregister */
+ struct devfs_inode inode;
+ umode_t mode;
+ unsigned short namelen; /* I think 64k+ filenames are a way off... */
+ unsigned char registered:1;
+ unsigned char hide:1;
+ unsigned char no_persistence:1;
+ char name[1]; /* This is just a dummy: the allocated array is
+ bigger. This is NULL-terminated */
+};
+
+/* ---------------------------------------------------------- */
+
+static inline devfs_handle_t i2d(struct inode *i)
+{
+ return (devfs_handle_t)(i->u.generic_ip);
+}
+static inline void set_de(struct inode *i, devfs_handle_t de)
+{
+ i->u.generic_ip = de;
+}
+
+static struct inode *devlink_inode(devfs_handle_t de);
+static int dl_setattr(struct dentry *dentry, struct iattr *attr);
+struct super_block *devlink_sb;
+
+static struct inode_operations devlink_dir_ops;
+static struct file_operations devlink_dir_fops;
+static struct inode_operations devlink_dev_ops;
+static struct inode_operations devlink_link_ops;
+static struct dentry_operations devlink_dops;
+static struct super_operations devlink_sops;
+
+
+struct dentry *devlink_find(struct dentry *ldentry, const char *link)
+{
+ /*
+ * here we find or create a dentry and matching inode
+ * for the devfs thing pointed to by "link"
+ * The dentry gets attached under ldentry with the
+ * name ".".
+ * If there is already a dentry we check if it
+ * has the right de. If it does we just use that
+ * If it has the wrong de, we unhash it first.
+ *
+ */
+
+ struct qstr dot = { ".", 1, 0 };
+ struct dentry *dentry = d_lookup(ldentry, &dot);
+ struct inode *inode;
+ int err;
+ devfs_handle_t de =
+ devfs_find_handle(NULL, link, 0, 0, 0, 0);
+
+ down(&ldentry->d_inode->i_sem);
+ if (dentry) {
+ if (!dentry->d_inode) {
+ if (de == NULL)
+ goto out;
+ } else {
+ if (i2d(dentry->d_inode) == de)
+ goto out;
+ }
+
+ d_drop(dentry);
+ dput(dentry);
+ }
+ /* We need to allocate a new dentry...*/
+ err = -ENOMEM;
+ dentry = d_alloc(ldentry, &dot);
+ if (!dentry)
+ goto err;
+ if (!de) {
+ d_add(dentry, NULL);
+ goto out;
+ }
+ /* looks like we need an inode too */
+ inode = devlink_inode(de);
+ if (!inode)
+ goto err;
+ d_add(dentry, inode);
+ dl_setattr(dentry, NULL);
+
+ out:
+ up(&ldentry->d_inode->i_sem);
+ return dentry;
+
+ err:
+ up(&ldentry->d_inode->i_sem);
+ return ERR_PTR(err);
+}
+
+static int dl_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ /* We don't allow any attribute setting
+ * but whenever anyone tries, we make sure that
+ * our attributes match our parent
+ */
+ struct inode *inode, *parent;
+ inode = dentry->d_inode;
+
+ /* find the dentry for the devlink */
+ do {
+ dentry = dentry->d_parent;
+ } while (dentry != dentry->d_parent
+ && ! S_ISLNK(dentry->d_inode->i_mode));
+ parent = dentry->d_inode;
+
+ if (inode) {
+ inode->i_uid = parent->i_uid;
+ inode->i_gid = parent->i_gid;
+ inode->i_mode =
+ (inode->i_mode & (S_IFMT|S_ISVTX))
+ | (parent->i_mode & (S_IRUGO|S_IWUGO));
+
+ /* copy Read to eXecute for directories */
+ if (S_ISDIR(inode->i_mode))
+ inode->i_mode |= (parent->i_mode & S_IRUGO)>>2;
+ }
+ return 0;
+}
+
+/* devlink_inode creates a new inode for a given
+ * devfs entry.
+ *
+ */
+static struct inode *devlink_inode(devfs_handle_t de)
+{
+ struct inode *inode;
+ int mode;
+
+ if (de == NULL)
+ return NULL;
+ if (!de->registered)
+ return NULL;
+ if (S_ISDIR(de->mode))
+ mode = S_IFDIR;
+ else if (S_ISCHR(de->mode))
+ mode = S_IFCHR;
+ else if (S_ISBLK(de->mode))
+ mode = S_IFBLK;
+ else if (S_ISLNK(de->mode))
+ mode = S_IFLNK|S_ISVTX;
+ else
+ return NULL;
+ inode = get_empty_inode();
+ if (inode) {
+ inode->i_sb = devlink_sb;
+ inode->i_ino = de->inode.ino;
+ inode->i_dev = devlink_sb->s_dev;
+
+ inode->i_mode= mode;
+ inode->i_nlink=1;
+ inode->i_uid = inode->i_gid = 0;
+ inode->i_size = 0;
+ inode->i_atime = inode->i_mtime = inode->i_ctime =
+ CURRENT_TIME;
+ inode->i_blocks = 0;
+ switch(mode) {
+ case S_IFCHR:
+ case S_IFBLK:
+ init_special_inode(inode, mode,
+ MKDEV (de->u.fcb.u.device.major,
+ de->u.fcb.u.device.minor));
+ inode->i_op = &devlink_dev_ops;
+ break;
+ case S_IFDIR:
+ inode->i_op = &devlink_dir_ops;
+ inode->i_fop = &devlink_dir_fops;
+ break;
+ case S_IFLNK|S_ISVTX:
+ inode->i_op = &devlink_link_ops;
+ }
+ set_de(inode, de);
+ }
+ return inode;
+}
+static struct dentry *dl_lookup(struct inode *dir, struct dentry *dentry)
+{
+ devfs_handle_t de = i2d(dir);
+ devfs_handle_t d2 = devfs_find_handle(de,
+ dentry->d_name.name,
+ 0, 0, 0, 0);
+ struct inode *inode;
+
+ dentry->d_op = &devlink_dops;
+
+ inode = devlink_inode(d2);
+ d_add(dentry, inode);
+ dl_setattr(dentry, NULL);
+ return NULL;
+}
+
+static int dl_readdir(struct file *file, void *cookie, filldir_t func)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ devfs_handle_t parent = i2d(inode);
+ devfs_handle_t de;
+ int i=0;
+ int err=0;
+ int stored=0;
+ long pos = file->f_pos;
+
+ switch(pos) {
+ case 0:
+ err = func(cookie, ".", 1, 0, inode->i_ino, DT_DIR);
+ if (err<0) break;
+ stored++;
+ pos++;
+ /* FALLTHROUGH */
+ case 1:
+ err = func(cookie, "..", 2, 1, file->f_dentry->d_parent->d_inode->i_ino,
+ DT_DIR);
+ if (err < 0) break;
+ stored++;
+ pos++;
+ /* FALLTHROUGH */
+ default:
+ i=1;
+ for (de = parent->u.dir.first; de != NULL; de=de->next) {
+ if (!de->registered)
+ continue;
+ i++;
+ if (i<pos)
+ continue;
+ err = func(cookie, de->name, de->namelen,
+ pos, de->inode.ino,
+ de->mode >> 12);
+ if (err < 0) break;
+ pos++;
+ stored++;
+ }
+ }
+ file->f_pos = pos;
+ if (err < 0 && err != -EINVAL)
+ return err;
+ return stored;
+}
+
+static int dl_readlink(struct dentry *dentry, char *buf, int bufsiz)
+{
+
+ devfs_handle_t de = i2d(dentry->d_inode);
+ char *link = de->u.symlink.linkname;
+ if (link && de->registered)
+ /* WARNING bad race here */
+ return vfs_readlink(dentry, buf, bufsiz, link);
+
+ return -ENODEV;
+}
+
+static int dl_follow_link(struct dentry *dentry, struct nameidata *nd)
+{
+ devfs_handle_t de = i2d(dentry->d_inode);
+ char *link = de->u.symlink.linkname;
+ if (link && de->registered) {
+ /* WARNING bad race here */
+ struct dentry *d = devlink_find(dentry, link);
+ dput(nd->dentry);
+ nd->dentry = d;
+ return 0;
+ } else
+ return -ENODEV;
+}
+
+static int dl_revalidate(struct dentry *dentry, int flags)
+{
+ devfs_handle_t de;
+
+ if (dentry->d_inode == NULL) {
+ /* negative dentry.
+ * For a re-lookup
+ */
+ return 0;
+ }
+ dl_setattr(dentry, NULL); /* update attributes */
+ de = i2d(dentry->d_inode);
+ if (de->registered)
+ return 1;
+ d_drop(dentry); /* extreme prejudice... */
+ return 0;
+}
+
+static int dl_i_revalidate(struct dentry *dentry)
+{
+ if (dl_revalidate(dentry, 0))
+ return 0;
+ else
+ return -ENODEV;
+}
+
+static void dl_put_inode(struct inode *inode)
+{
+/* devfs_handle_t de = i2d(inode); */
+ /* I would really like to drop the reference count
+ * on de, but there isn't one...
+ */
+}
+
+static struct vfsmount *devlink_mnt;
+
+static struct super_block *dl_read_super(struct super_block *sb,
+ void *data, int silent)
+{
+ /* I wonder what I really need here ...
+ * just copy some stuff from pipefs for now
+ */
+ struct inode *root = new_inode(sb);
+ if (!root)
+ return NULL;
+ root->i_mode = S_IFDIR | S_IRUSR | S_IWUSR;
+ root->i_uid = root->i_gid = 0;
+ root->i_atime = root->i_mtime = root->i_ctime = CURRENT_TIME;
+ sb->s_blocksize = 1024;
+ sb->s_blocksize_bits = 10;
+ sb->s_magic = DEVLINK_MAGIC;
+ sb->s_op = &devlink_sops;
+ sb->s_root = d_alloc(NULL, &(const struct qstr) { "devlink:", 8, 0 });
+ if (!sb->s_root) {
+ iput(root);
+ return NULL;
+ }
+ sb->s_root->d_sb = sb;
+ sb->s_root->d_parent = sb->s_root;
+ d_instantiate(sb->s_root, root);
+ devlink_sb = sb;
+ return sb;
+
+}
+
+static struct inode_operations devlink_dir_ops = {
+ lookup: dl_lookup,
+ setattr: dl_setattr,
+ revalidate: dl_i_revalidate,
+};
+
+static struct file_operations devlink_dir_fops = {
+ readdir: dl_readdir,
+};
+static struct inode_operations devlink_dev_ops = {
+ setattr: dl_setattr,
+ revalidate: dl_i_revalidate,
+};
+
+static struct inode_operations devlink_link_ops = {
+ readlink: dl_readlink,
+ follow_link: dl_follow_link,
+ setattr: dl_setattr,
+ revalidate: dl_i_revalidate,
+};
+
+static struct dentry_operations devlink_dops = {
+ d_revalidate: dl_revalidate,
+};
+
+static struct super_operations devlink_sops = {
+ put_inode: dl_put_inode,
+};
+
+
+static DECLARE_FSTYPE(dl_fs_type, "devlink", dl_read_super, FS_SINGLE|FS_NOMOUNT);
+
+static int __init init_dl_fs(void)
+{
+ int err = register_filesystem(&dl_fs_type);
+ if (!err) {
+ devlink_mnt = kern_mount(&dl_fs_type);
+ err = PTR_ERR(devlink_mnt);
+ if (IS_ERR(devlink_mnt))
+ unregister_filesystem(&dl_fs_type);
+ else
+ err = 0;
+ }
+ return err;
+}
+
+static void __exit exit_dl_fs(void)
+{
+ unregister_filesystem(&dl_fs_type);
+ mntput(devlink_mnt);
+}
+
+module_init(init_dl_fs)
+module_exit(exit_dl_fs)
+
+
+
+
--- ./fs/Makefile 2001/11/16 03:56:35 1.1
+++ ./fs/Makefile 2001/11/16 09:06:50 1.2
@@ -25,6 +25,8 @@
subdir-$(CONFIG_PROC_FS) += proc
subdir-y += partitions

+obj-$(CONFIG_DEVLINK) += devlink.o
+
# Do not add any filesystems before this line
subdir-$(CONFIG_EXT3_FS) += ext3 # Before ext2 so root fs can be ext3
subdir-$(CONFIG_JBD) += jbd
--- ./fs/Config.in 2001/11/16 03:56:35 1.1
+++ ./fs/Config.in 2001/11/16 09:06:50 1.2
@@ -64,6 +64,7 @@
dep_bool '/dev file system support (EXPERIMENTAL)' CONFIG_DEVFS_FS $CONFIG_EXPERIMENTAL
dep_bool ' Automatically mount at boot' CONFIG_DEVFS_MOUNT $CONFIG_DEVFS_FS
dep_bool ' Debug devfs' CONFIG_DEVFS_DEBUG $CONFIG_DEVFS_FS
+dep_bool ' Devlink support (VeryExperimental)' CONFIG_DEVLINK $CONFIG_DEVFS_FS

# It compiles as a module for testing only. It should not be used
# as a module in general. If we make this "tristate", a bunch of people
--- ./fs/dcache.c 2001/11/16 03:59:12 1.1
+++ ./fs/dcache.c 2001/11/16 09:06:50 1.2
@@ -976,6 +976,13 @@
}
parent = dentry->d_parent;
namelen = dentry->d_name.len;
+#ifdef CONFIG_DEVLINK
+ /* avoid /./ appearing in paths beneath devlinks */
+ if (namelen == 1 && dentry->d_name.name[0] == '.') {
+ dentry = parent;
+ continue;
+ }
+#endif
buflen -= namelen + 1;
if (buflen < 0)
break;
--- ./mm/shmem.c 2001/11/16 03:56:35 1.1
+++ ./mm/shmem.c 2001/11/16 09:06:51 1.2
@@ -1150,7 +1150,7 @@

static int shmem_follow_link_inline(struct dentry *dentry, struct nameidata *nd)
{
- return vfs_follow_link(nd, (const char *)SHMEM_I(dentry->d_inode));
+ return vfs_follow_link(nd, (const char *)SHMEM_I(dentry->d_inode), dentry);
}

static int shmem_readlink(struct dentry *dentry, char *buffer, int buflen)
@@ -1174,7 +1174,7 @@
if (res)
return res;

- res = vfs_follow_link(nd, kmap(page));
+ res = vfs_follow_link(nd, kmap(page), dentry);
kunmap(page);
page_cache_release(page);
return res;
--- ./include/linux/fs.h 2001/11/16 03:56:09 1.2
+++ ./include/linux/fs.h 2001/11/16 09:06:51 1.3
@@ -313,8 +313,6 @@
#include <linux/proc_fs_i.h>
#include <linux/usbdev_fs_i.h>
#include <linux/jffs2_fs_i.h>
-#include <linux/cramfs_fs_sb.h>
-
/*
* Attribute flags. These should be or-ed together to figure out what
* has been changed!
@@ -1258,6 +1256,11 @@
#define LOOKUP_POSITIVE (8)
#define LOOKUP_PARENT (16)
#define LOOKUP_NOALT (32)
+#ifdef CONFIG_DEVLINK
+#define LOOKUP_DEVLINK (64) /* Don't follow devlinks at end of path */
+#else
+#define LOOKUP_DEVLINK (0)
+#endif
/*
* Type of the last component on LOOKUP_PARENT
*/
@@ -1295,6 +1298,8 @@
extern struct dentry * lookup_hash(struct qstr *, struct dentry *);
#define user_path_walk(name,nd) __user_walk(name, LOOKUP_FOLLOW|LOOKUP_POSITIVE, nd)
#define user_path_walk_link(name,nd) __user_walk(name, LOOKUP_POSITIVE, nd)
+#define user_path_walk_devlink(name,nd) __user_walk(name, LOOKUP_FOLLOW|LOOKUP_POSITIVE|LOOKUP_DEVLINK, nd)
+extern struct dentry *devlink_find(struct dentry *ldentry, const char *link);

extern void iput(struct inode *);
extern void force_delete(struct inode *);
@@ -1387,7 +1392,7 @@
extern struct file_operations generic_ro_fops;

extern int vfs_readlink(struct dentry *, char *, int, const char *);
-extern int vfs_follow_link(struct nameidata *, const char *);
+extern int vfs_follow_link(struct nameidata *, const char *, struct dentry *);
extern int page_readlink(struct dentry *, char *, int);
extern int page_follow_link(struct dentry *, struct nameidata *);
extern struct inode_operations page_symlink_inode_operations;
--- ./Documentation/Configure.help 2001/11/16 03:56:35 1.1
+++ ./Documentation/Configure.help 2001/11/16 09:06:51 1.2
@@ -14114,6 +14114,23 @@

If unsure, say N.

+devlink support (Rather Experimental)
+CONFIG_DEVLINK
+ This provides support for devlinks, which are a cross between symlinks
+ and device special files.
+ ln -s /dev/path/to/device/in/devfs /dev/myname
+ mknod /dev/myname b 0 0
+ which create a devlink which looks like a symlink, but has the sticky
+ bit set. Chmod and chown on devlinks affect the link, not the
+ the thing pointed to (at least not directly).
+ When you access a devlink, it provides you with an image of the
+ thing named in devfs, but with the same owner/group/perm as the
+ devlink has.
+ You don't need to have devfs mounted for this to work, but it must
+ be compiled in.
+
+ This code has lots of potential races. Don't use it in production.
+
NFS file system support
CONFIG_NFS_FS
If you are connected to some other (usually local) Unix computer


2001-11-16 10:18:55

by Alan

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

> A device special file is a gateway between a user (admin)
> controlled name space (the filesystem) and a kernel imposed name
> space (major/minor numbers) that recognises and imposes access
> control (owner/group/permissions).
>
> The (a) problem with this is that major/minor numbers are too limited,

Textual names have unsolved problems too
1. Who administers the namespace
2. When trademarks get entangled whats the disputes procedure

Do you want to create a situation where a future kernel is likely to be
forced to change a device naming because an "official" vendor driver appears
too and they demand the namespace and wave trademarks around ?

> A Devlink looks like a symlink with the "sticky" (S_ISVTX) bit set.
> Indeed, that is how it is stored on a filesystem.

That seems basically sound. I'm not sure about the devfs part but that
is a seperate matter.

Alan

2001-11-16 10:33:46

by Alexander Viro

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)



On Fri, 16 Nov 2001, Neil Brown wrote:

> + if (!(nd->mnt->mnt_flags & MNT_NODEV)
> + && dentry->d_inode
> + && (dentry->d_inode->i_mode & S_ISVTX)) {
> + dentry = devlink_find(dentry, link);

You are breaking vfsmount refcounting. Badly.

2001-11-16 19:26:02

by Andrew Pimlott

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

Neil,

I'm just a user (not a kernel hacker), but I strongly support this
idea. It is unix-ish yet addresses the problem space aptly. One of
the best parts in my view is that it allows devfs to expose multiple
views of the hardware (eg, organized by bus, by function, by uuid),
and the admin can then choose the most appropriate. Another is that
it puts to rest claims that devfs is policy in the kernel, because
devlinks would give the admin the same flexibility he has with a
traditional /dev.

In fact, I have brought the same concept up in private mail with
Richard Gooch and at least once on this list:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0103.3/0563.html .
Albert Cahalan replied with a brief proposal that differs a bit from
yours:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0103.3/0574.html .

I also have a few comments on your implementation.

> To create a devlink, you use mknod on a pre-existing symlink. The
> mknod must request a device (block or char) with device number 0,0.
> e.g.
> ln -s tty /dev/TTY
> mknod /dev/TTY b 0 0
>
> This will create a devlink called "/dev/TTY" which points to the name
> "tty" in devfs space.

I think it would be a mistake to have the symlink implicitly rooted
in the devfs name space. One, it breaks the principle of least
surprise. Two, it means that the target of the symlink suddenly
changes (from /dev/tty to tty in the devfs namespace) when you do
the mknod. Three, it precludes the possibility of extending
"devlinks" to point to normal files (in which case, devlink isn't
the right name), which I don't think should be dismissed.

I also think that lchmod might be a more elegant system call
interface.

> ls -l /dev/TTY
>
> will show the devlink.
>
> ls -lL /dev/TTY
>
> will show the traditional device special file.

But this would require a patch to ls or libc, no? I think this can
be done such that old tools still show something reasonable.

> Once you have turned a symlink into a devlink, chmod and chown will
> work directly on the devlink so you can change the permissions and
> ownership freely. The ownership and permissions are automatically
> imposed on everything that the devlink points to.
>
> A devlink can point to anything in the devfs namespace, not just
> devices. e.g
>
> ln -s ide /dev/ide
> mknod /dev/ide b 0 0
>
> will make /dev/ide be a devlink the the ide tree within devfs.
> Then
> cd /dev/ide
> will work and allow you to move around the directory tree. Everything
> in the directory tree will have the same ownership and permissions as
> the devlink has, except for the execute bits. For directories, the
> execute bits are copied from the read bits. For non-directories, the
> execute bits are cleared.

I haven't thought this through carefully, but I think applying too
much "devfs magic" to devlinks is a mistake. I think the results
should be what a unix user would intuitively expect from a "symlink
with permissions". So, I don't like the idea of a devlink giving
access recursively. Some of the later ideas (eg, the pwd magic)
strike me as questionable as well.

> You cannot do
>
> ln -s '' /devices
> mknod /devices b 0 0
>
> and get the full devfs namespace under /devices, but only because
> of a shortcoming the the devfs code (that normally would never be
> asked to do this anyway). It could fairly easily be fixed but it
> didn't seem worth the effort for this proof-of-concept.

This would be ugly and inconsistent anyway. Nowhere else in unix
does an empty path make sense. The only sane interpretation would
be an entry in the devfs root whose name is the empty string.

BTW, as to Alan's objections, I think they are greatly overblown.
With sane device registration API's, managing the namespace should
be no harder than managing the module name namespace. And if
trademark owners come after us for calling things by their names, we
likely have bigger problems (all of userspace that refers to the
verboten device name would have to be changed).

The potential pain is worth it. Once we get used to hierarchical
text names for kernel objects, we won't know how we did without
them.

Andrew

2001-11-16 20:42:41

by NeilBrown

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Friday November 16, [email protected] wrote:
> Neil,
>
> I'm just a user (not a kernel hacker), but I strongly support this
> idea. It is unix-ish yet addresses the problem space aptly. One of
> the best parts in my view is that it allows devfs to expose multiple
> views of the hardware (eg, organized by bus, by function, by uuid),
> and the admin can then choose the most appropriate. Another is that
> it puts to rest claims that devfs is policy in the kernel, because
> devlinks would give the admin the same flexibility he has with a
> traditional /dev.

Thankyou.

>
> In fact, I have brought the same concept up in private mail with
> Richard Gooch and at least once on this list:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0103.3/0563.html .
> Albert Cahalan replied with a brief proposal that differs a bit from
> yours:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0103.3/0574.html .

Substantial similarities, yes.
The idea of "setuid symlinks" is cute, and I have toyed with it. But
implementing a general "setuid symlink" that could point anywhere
would require very intrusive changes to the VFS layer.
That doesn't necessarily mean that it is a bad thing, but you would
have to weight the cost/benefit carefully.

>
> I also have a few comments on your implementation.
>
> > To create a devlink, you use mknod on a pre-existing symlink. The
> > mknod must request a device (block or char) with device number 0,0.
> > e.g.
> > ln -s tty /dev/TTY
> > mknod /dev/TTY b 0 0
> >
> > This will create a devlink called "/dev/TTY" which points to the name
> > "tty" in devfs space.
>
> I think it would be a mistake to have the symlink implicitly rooted
> in the devfs name space. One, it breaks the principle of least
> surprise. Two, it means that the target of the symlink suddenly
> changes (from /dev/tty to tty in the devfs namespace) when you do
> the mknod. Three, it precludes the possibility of extending
> "devlinks" to point to normal files (in which case, devlink isn't
> the right name), which I don't think should be dismissed.

Well, the value of the symlink is simply interpreted by devfs.
You could use
/dev/tty
or
/devices/tty
or
////IloveLinux//tty
and get the same result as
tty

I would probably recommend people use "/devices/tty", and then if they
boot with a non-devlink kernel they can mount devfs at /devices and
get a working system, and if they boot with a non-devfs kernel they
can make some devices under /devices and have a working system.

>
> I also think that lchmod might be a more elegant system call
> interface.

Except that "lchmod" doesn't exist. I didn't want to add any new
system calls.

>
> > ls -l /dev/TTY
> >
> > will show the devlink.
> >
> > ls -lL /dev/TTY
> >
> > will show the traditional device special file.
>
> But this would require a patch to ls or libc, no? I think this can
> be done such that old tools still show something reasonable.

no. To userspace, devlinks look a lot like symlink. Try it and see.
Actually they look more like those magic symlinks in /proc. You
follow them and you get somewhere, but it isn't necessarily the same
place that you get if you did a readlink, and then followed that.

>
> > Then
> > cd /dev/ide
> > will work and allow you to move around the directory tree. Everything
> > in the directory tree will have the same ownership and permissions as
> > the devlink has, except for the execute bits. For directories, the
> > execute bits are copied from the read bits. For non-directories, the
> > execute bits are cleared.
>
> I haven't thought this through carefully, but I think applying too
> much "devfs magic" to devlinks is a mistake. I think the results
> should be what a unix user would intuitively expect from a "symlink
> with permissions". So, I don't like the idea of a devlink giving
> access recursively. Some of the later ideas (eg, the pwd magic)
> strike me as questionable as well.
>

Devices can appear in the devfs namespace spontaneously (hotplug). If
devlinks could only point to devices, then you would *have* to have a
daemon just to be able to see the things.
This way, you only need a daemon (or user user-space helper) if you
want to do clever things with permissions or other configuration.

> > You cannot do
> >
> > ln -s '' /devices
> > mknod /devices b 0 0
> >
> > and get the full devfs namespace under /devices, but only because
> > of a shortcoming the the devfs code (that normally would never be
> > asked to do this anyway). It could fairly easily be fixed but it
> > didn't seem worth the effort for this proof-of-concept.
>
> This would be ugly and inconsistent anyway. Nowhere else in unix
> does an empty path make sense. The only sane interpretation would
> be an entry in the devfs root whose name is the empty string.

The Emtpy string *should* and *used-to* mean "the current
directory(*). But some standards body somewhere broke that about 15
years ago.

As I mention above, it is the same as
ln -s /dev /devices
mknod /devices b 0 0

NeilBrown


(*)
The correct syntax for filenames is:

directoryname: slash # means root
| empty # means current directory
| filename slash # means directory stored in the file

filename : directory member # means that member of the directory.


slash == "/+" # one or more slashes
empty == "" # no characters
member == "[^/]+" # one or more non-slash characters.


Most (all) directories contain "." as a name for themselves. Thus
".", which is <empty> followed by "." is a name for the file that
contains the current directory.

That worked fine in 4.4BSD and Edition 7 Unix. But SysV broke it.

2001-11-19 03:47:45

by NeilBrown

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Friday November 16, [email protected] wrote:
> > A device special file is a gateway between a user (admin)
> > controlled name space (the filesystem) and a kernel imposed name
> > space (major/minor numbers) that recognises and imposes access
> > control (owner/group/permissions).
> >
> > The (a) problem with this is that major/minor numbers are too limited,
>
> Textual names have unsolved problems too
> 1. Who administers the namespace

lanana??

> 2. When trademarks get entangled whats the disputes procedure
>
> Do you want to create a situation where a future kernel is likely to be
> forced to change a device naming because an "official" vendor driver appears
> too and they demand the namespace and wave trademarks around ?
>

I admit I hadn't thought about this sort of issue at all. trademarks
are certainly something we want to avoid.
So I thought about it for a couple of days, and:

Looking through the names provided by "devfs", I don't see any place
for a trade-mark-like name to go. It is mostly generic names, like
"disc", "cdrom", "scsi", "tty", "host", and numbers.

As I understand trademarks, they are granted for a particular context.
There is no conflict between "Dove" as a brand name for soap, "Dove"
as a brand name for chocolate, and "Dove" as used by bird watchers in
their taxonomy.

Similarly, device names that Linux uses are, and should be, generic
words and exist in a name-space that is not really trademark-able.

Interestingly, I have a driver for a battery-backed memory card made
by "Micro Memory" (or "umem"). The driver presents the card as a
normal block device and you would use it just like a disc driver (only
with lower latency). So I would like to identify it as a "disc
drive".
But Linux has no generic "disc" type, so I really need to allocate a
new major number and give it a name (for /proc/devices) which will
inevitably be "umem" or similar. Thus the current scheme seems to
encourage using trademarks more than a properly structured scheme
would.

However my thoughts about names didn't stop there, so nor will this
email :-)
While the naming in "devfs" has a lot of good points, it does not have
a clear, over-arching, strucure, so it is not clear how/where best to
add things. This points to another problem with textual name spaces
that you did not mention: They are just *too* flexable.
They give you enough rope to shoot yourself in the foot, and all
that. Just look at procfs.

So I think that it is very important that a simple and elegant naming
scheme is used that, to use the venacular:
(a) is _right_ and
(b) is right.

What seems right to me is to have a three level hierarchy with clear
meanings for the three levels so that in every case, the choice of
name will be obvious, and where trade marks and such will just never
be considered as candidates.
The three levels (which correspond loosely to major/minor/char-or-block)
are "address-space", "address", and "interface".
The "address space" would be something like
"scsi", "pci", "disc", "tty", "usb", "printer", "devid", "md"

Each of them are very generic terms.
The "address" is something that is specific to the address space.
It will sometimes be a "physical" address, sometimes a sequential
number, sometimes a content-based address, and often a combination of
2 of those. It will very often be numeric. It may have an internal
hierarchical structure.
So in the "scsi" address space, addresses would be 4 numbers
representing host, bus, device, unit. "host" would be a sequential
number, the others have some external significance.
In the "pci" address space, you would have bus/device/function.
The "disc" address space would be simple sequentially assigned
numbers.

"devid" addresses are device id's such as pci ids, or usb ids, or
pcmcia device ids (does that work, are they all one big address
space??) followed by an instance number: 1, 2, 3 etc.

The "interface" part indicates what sort of object can be found at
that address in that address space.

For example pci bus 1, device 4, function 1 might be a SCSI
controller, the 3rd scsi controler in the system,
so
pci/1/4/1/scsi
is a devlink pointing at scsi/3
This host has one buss(channel) with several devices(ids). Device 3 is a
disc drive, the 12th disc drive in the system, so
scsi/3/0/3/0/disc
is a devlink to disk/12, and
scsi/3/0/3/0/generic
is a char-special device which provides generic access to the device.

This disc has 4 partitions, so
disc/12/0 is a block device for the whole disc
disc/12/1 is a block device for the first partition
disc/12/2 is a block device for the second partition
etc.

devid/9005/00CF/1

is an alias rather than a real device, and so would not be
a directory containing interfaces but a devlink to pci/1/4/1.

You might note that the names of address spaces are often also the
names of interfaces. "scsi" and "disc" fill both roles in the above
example.
So a full name to a target might have the form:
interface/address/interface/address/interface/address
e.g.
pci/1/4/1/scsi/0/3/0/disc/1

I am undecided if this should be broken with devlinks as in the above
example, or if there should be a primary name with no "sequence
number" addresses, and everything else is devlink aliases.
i.e.
scsi/3 -> pci/1/4/1/scsi
disc/12 -> scsi/0/3/0/disc

This is more like what devfs does.

In a fairly real sense, it doesn't make any difference, but I'm not
sure yet.

In many ways this is similar to the naming that devfs uses.
Some differences are:
- devfs put lots of redundant words in the address part:
"host" "bus" "cdrom" etc.
I gather these are Linus-mandated. But I find them *very*
noisy.
- devfs puts a lot of miscellaneous stuff in the top level.
I would want to group them into one namespace. e.g.:
misc/memory/mem
misc/memory/kmem
misc/memory/zero
misc/memory/null
misc/random/random
misc/random/urandom
- Some parts of devfs use a slightly different structure. For
example, "pty" contains both master and slave devices, with the "m"
or "s" preceeding the number. The above scheme would instead give
an address space of "pty", addresses of seqential numbers, and
interfaces of "master" and "slave", so
pty/1/master instead of pty/m1
pty/2/slave instead of pty/s2


What I would really like to see is a very light weight naming scheme
used internally by the kernel, and devlinks and devfs should just be
different ways to expose that scheme to userspace.... I wonder how
much code that would take....

NeilBrown

2001-11-19 04:02:36

by NeilBrown

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Friday November 16, [email protected] wrote:
>
>
> On Fri, 16 Nov 2001, Neil Brown wrote:
>
> > + if (!(nd->mnt->mnt_flags & MNT_NODEV)
> > + && dentry->d_inode
> > + && (dentry->d_inode->i_mode & S_ISVTX)) {
> > + dentry = devlink_find(dentry, link);
>
> You are breaking vfsmount refcounting. Badly.

I looked, and I cannot see it.
I never change the refcound on any vfsmount, nor to I make
or destroy any references to any vfsmount.
In this piece of code we don't even own a reference to "dentry" (the
caller does) so assigning over it isn't a problem either.

About the only thing that might be a bit odd here is that we change
nd->dentry a few lines later without changing nd->mnt. But the new
dentry is always in the same dentry tree (though it is in owned
by a different filesystem).

Would you care to give a few more details?

NeilBrown

2001-11-19 08:53:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Nov 19, 2001 14:47 +1100, Neil Brown wrote:
> - devfs puts a lot of miscellaneous stuff in the top level.
> I would want to group them into one namespace. e.g.:
> misc/memory/mem
> misc/memory/kmem
> misc/memory/zero
> misc/memory/null
> misc/random/random
> misc/random/urandom

Erm, what about the millions+ of scripts/apps that reference /dev/zero
or /dev/null?

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-19 10:09:54

by Alan

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

> As I understand trademarks, they are granted for a particular context.
> There is no conflict between "Dove" as a brand name for soap, "Dove"
> as a brand name for chocolate, and "Dove" as used by bird watchers in
> their taxonomy.

We have already had a vendor threaten legal action if we didn't change the
name of a file system before merging it with the kernel.

Alan

2001-11-19 10:31:05

by NeilBrown

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Monday November 19, [email protected] wrote:
> On Nov 19, 2001 14:47 +1100, Neil Brown wrote:
> > - devfs puts a lot of miscellaneous stuff in the top level.
> > I would want to group them into one namespace. e.g.:
> > misc/memory/mem
> > misc/memory/kmem
> > misc/memory/zero
> > misc/memory/null
> > misc/random/random
> > misc/random/urandom
>
> Erm, what about the millions+ of scripts/apps that reference /dev/zero
> or /dev/null?

# ln -s misc/memory/null /dev/null
# mknod /dev/null c 0 0

I was talking about how naming should look inside the kernel. The
names that are presented to user-space are up to user-space. The
names that are used internally should make sense internally.
mknod is there to provide a gateway between the two.

Devfs defines internal names which are, to some extent, chosen to
match expected external names. This is putting policy in the kernel,
which is one of the complaints about devfs.
Devfs actually has a bit each way. There is a concept of "compatible"
names which are imposed on devfs by devfsd, so that internal names
don't have to match external names. But many interal names do still
match historical external names.

I would like a very well defined and rigidly adhered to structure for
internal names. This should be chosen to match the internal (actual
or planned) of the kernel. And there should be a mechanism to allow
user-space to define external names to map to those internal names.

NeilBrown

2001-11-19 10:40:25

by NeilBrown

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Monday November 19, [email protected] wrote:
> > As I understand trademarks, they are granted for a particular context.
> > There is no conflict between "Dove" as a brand name for soap, "Dove"
> > as a brand name for chocolate, and "Dove" as used by bird watchers in
> > their taxonomy.
>
> We have already had a vendor threaten legal action if we didn't change the
> name of a file system before merging it with the kernel.
>
> Alan

I think you missed part of my point.
There are lots of different name spaces in the kernel.
Filesystem names. Driver names. Module names.

Some of them may well have trademark related issues.

But the namespace that is the current issue, the namespace of
currently available devices, is not a namespace where I would expect
trademarks to ever come up. It is name space of interfaces and
instances.

So I hold that "trademark issues" is not a valid reason to avoid
moving from pure b/c+major+minor names to textual names as the
preferred names for currently available devices.

NeilBrown

2001-11-19 11:05:28

by Erik Andersen

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Mon Nov 19, 2001 at 10:17:29AM +0000, Alan Cox wrote:
> > As I understand trademarks, they are granted for a particular context.
> > There is no conflict between "Dove" as a brand name for soap, "Dove"
> > as a brand name for chocolate, and "Dove" as used by bird watchers in
> > their taxonomy.
>
> We have already had a vendor threaten legal action if we didn't change the
> name of a file system before merging it with the kernel.

Sore losers I guess. :-)

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2001-11-19 11:07:18

by Alan

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

> I think you missed part of my point.
> There are lots of different name spaces in the kernel.
> Filesystem names. Driver names. Module names.
>
> But the namespace that is the current issue, the namespace of
> currently available devices, is not a namespace where I would expect
> trademarks to ever come up. It is name space of interfaces and
> instances.

You mean like adaptec/aic7xxx/0 for the first aic7xxx controller when you
want to refer to an adaptec card ? And yes - you do need the ability to do
that kind of thing, not just talk generically about "disks".

So I still seek an answer. "Shrug, probably wont happen" isnt a good one

Alan

2001-11-19 11:27:41

by Alexander Viro

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)



On Mon, 19 Nov 2001, Neil Brown wrote:
> I was thinking:
>
> devid/9005/00cf/0

And that you would call a text? We are just trading two numbers for
a bunch of them. Better yet, how about a driver that treats several
cards as identical?

2001-11-19 11:21:40

by NeilBrown

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Monday November 19, [email protected] wrote:
> > I think you missed part of my point.
> > There are lots of different name spaces in the kernel.
> > Filesystem names. Driver names. Module names.
> >
> > But the namespace that is the current issue, the namespace of
> > currently available devices, is not a namespace where I would expect
> > trademarks to ever come up. It is name space of interfaces and
> > instances.
>
> You mean like adaptec/aic7xxx/0 for the first aic7xxx controller when you
> want to refer to an adaptec card ? And yes - you do need the ability to do
> that kind of thing, not just talk generically about "disks".
>
> So I still seek an answer. "Shrug, probably wont happen" isnt a good
> one

I was thinking:

devid/9005/00cf/0

Now maybe the numbers can be trade marks too (I always liked "S3"'s id: 5333).
However this number is extracted from the device in question. Surely
if I have a device that reports itself as "9005:00cf", then there can
be no trademark violation in addressing the device as "the one which
calls itself 9005:00cf".
There may well be cases where a textual name in more appropriate
camera/Kodak DX3115/0/3/thumbnail
but if it is a name that you extract from the device, then you should
be safe. If there is a trademark violation, then it is in the device,
not in the operating system.

I guess that leaves

sound/SoundBlaster100%Compatible/

as a potential problem... but if the device is sold as "100%
Soundblaster compatible", then any trade mark has already been
violated.

I appreciate that "Shrug, probably wont happen" isn't really good
enough, but we cannot stop development of generic kernel facilities
out of fear of reprisals.

NeilBrown

>
> Alan

Subject: Re: Devlinks. Code. (Dcache abuse?)



--On Monday, 19 November, 2001 11:14 AM +0000 Alan Cox
<[email protected]> wrote:

> You mean like adaptec/aic7xxx/0 for the first aic7xxx controller when you
> want to refer to an adaptec card ? And yes - you do need the ability to do
> that kind of thing, not just talk generically about "disks".

Which trademark law are you violating by having that in a directory
name path, which you are not also violating by having it in the
kernel source, make config, name of the module and its printk()
on load, etc. etc.

--
Alex Bligh

2001-11-19 21:29:27

by Alan

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

> Which trademark law are you violating by having that in a directory
> name path, which you are not also violating by having it in the
> kernel source, make config, name of the module and its printk()
> on load, etc. etc.

You can change all the other names with almost zero impact

Subject: Re: Devlinks. Code. (Dcache abuse?)

Alan,

>> Which trademark law are you violating by having that in a directory
>> name path, which you are not also violating by having it in the
>> kernel source, make config, name of the module and its printk()
>> on load, etc. etc.
>
> You can change all the other names with almost zero impact

Ah - OK; dname/dt didn't occur to me, but still this is
a consequence of a violation, not the violation itself; what
aspect of trademark law is a problem?

There are a few other examples of this. /proc/cpuinfo has

shed[amb].121$ cat /proc/cpuinfo
...
vendor_id : GenuineIntel
...
model name : Pentium III (Coppermine)

That's 2, if not 3 trademarks without acknowledgement that
might be searched for by userspace programs.

The solution is presumably that lanana doesn't accept /registered/
trademarks without a GPL compatible license from the trademark
holder. I don't believe you would have too much of a problem
with unregistered trademarks.

In any case, most trademark law has some concept of 'fair use'.
See the difficulty many trademark holders have in suing
registrants of [trademark]sucks.[registrysuffix]. I think
the use in terms of supporting hardware is pretty
fair. Cloning competing OS functionality is closer to
the wind I admit.

(Only tangentially relevant but for amusement value and a
beautifully argued case read
http://arbiter.wipo.int/domains/decisions/html/2001/d2001-0918.html
enjoyment almost guaranteed
)

--
Alex Bligh

2001-11-20 01:05:54

by NeilBrown

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Monday November 19, [email protected] wrote:
>
>
> On Mon, 19 Nov 2001, Neil Brown wrote:
> > I was thinking:
> >
> > devid/9005/00cf/0
>
> And that you would call a text? We are just trading two numbers for
> a bunch of them.

I disagree with the word "just". A bunch of numbers, with no fixed
limit, is *much* more usable than that 17 bits we now have, or the 33
bits that we might get.

Look at the naming used for MIBs, as in SNMP. It's just a list of
numbers, but it is very flexable and expressive. Not that I think
mib names are elegant. But they are expressive. Structure needs to
be imposed to get elegance.

So even if it were just a bunch of numbers it would be significantly
better. Allowing words is just icing on the cake.

> Better yet, how about a driver that treats several
> cards as identical?

What is so interesting about that. I'm not really interested in what
driver is being used. Just in what the hardware is, and what the
interface is. modutils might care about the driver, but devlinks
don't.

NeilBrown

2001-11-24 23:46:02

by Rob Landley

[permalink] [raw]
Subject: Re: Devlinks. Code. (Dcache abuse?)

On Monday 19 November 2001 06:21, Neil Brown wrote:
> On Monday November 19, [email protected] wrote:
> > > I think you missed part of my point.
> > > There are lots of different name spaces in the kernel.
> > > Filesystem names. Driver names. Module names.
> > >
> > > But the namespace that is the current issue, the namespace of
> > > currently available devices, is not a namespace where I would expect
> > > trademarks to ever come up. It is name space of interfaces and
> > > instances.
> >
> > You mean like adaptec/aic7xxx/0 for the first aic7xxx controller when you
> > want to refer to an adaptec card ? And yes - you do need the ability to
> > do that kind of thing, not just talk generically about "disks".
> >
> > So I still seek an answer. "Shrug, probably wont happen" isnt a good
> > one
>
> I was thinking:
>
> devid/9005/00cf/0
>
> Now maybe the numbers can be trade marks too (I always liked "S3"'s id:
> 5333). However this number is extracted from the device in question.

The reason Intel came up with the name "Pentium" is that a judge ruled
they couldn't trademark a number like "386" or "486" to stop AMD from using
it. Just a data point. What the law REALLY says these days is anybody's
guess, and you can be sure somebody's lobbying to make it worse...

The law is a lot like poker: bluffing and wagering more than your opponent
can afford is often more important than what your cards say. The MS
antitrust trial shows how when you stonewall it can take years for any
enforcement action to work its way through the bureaucracy, by which point
the issue is moot. And the RIAA shows how somebody without a leg to stand on
can get a really biased and/or ignorant judge to decide that PI should
henceforth be 3 in all government documents. But if you live your life in
fear of being sued, you can't even go out and buy groceries...

So a vendor THREATENING to sue is normal. Making threats is really cheap.
Actually following through requires spending money and taking a potential
public relations hit that can make it onto yahoo's business report where
investors read it and drive the stock price down, which they'd generally
rather like to avoid. Doesn't mean they won't, but it doesn't mean a form
letter on company letterhead justifies digging a bomb shelter...

Rob