2006-10-12 07:08:36

by Joel Becker

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Wed, Oct 11, 2006 at 07:18:36PM -0700, Matt Helsley wrote:
> On Wed, 2006-10-11 at 15:06 -0700, Joel Becker wrote:
> > I suspect I was too harsh here. Can you tell me what you are

Btw, Mark says 'hi' and that we met at his Scotch tasting party.
I didn't realize this was you ;-)

> A little too harsh IMHO -- this isn't something I'm suggesting out of
> laziness, lack of thought, nor am I some corporate stooge unfamiliar
> with Linux. I am well aware of this mantra. I am also well aware that
> there are counterexamples in sysfs (which has the same mantra):
>
> /sys/devices/pciXXXX:XX/XXXX:XX:XX.X/resource
> /sys/block/hda/stat
> /sys/block/hda/dev

Sure, but mere existence doesn't make it right ;-)

> Each of these has more than one number in them. And they don't all use
> the same delimeter. So yes, parsing is required in existing sysfs files.
> And the parsing I'm proposing is simpler than that needed in the pci
> resource file.

As sysfs grew, it became very clear that if you require parsing,
someone in userspace will get it wrong. It's happened with pci
resources, etc. What about people who say "I'll just add a field, no
one will notice" and everything breaks? That's just not acceptible.

> In fact we've discussed the problem with you in the past (on ckrm-tech)
> and described what we're looking for. You suggested adding file(s) to
> sysfs that replace this attribute. It was an interesting suggestion but
> does not work because we have one of these lists for each item in
> configfs. I think all of this discussion is in the ckrm-tech list
> archives too (in MARC, possibly gmane, etc.).

Sure it works. You have one per resource group. In
resource_group_make_object(), you sysfs_mkdir() the sysfs file. There
is no technical limitation at all. Now, it is nice that the pid list
lives in the configfs tree from a location standpoint. Sure. But is it
worth breaking the simplicity of configfs. Greg says NO! You say YES!
I say "I'm not sure". And here we are.

> A patch would best explain what I'm proposing. But since I don't have a
> working patch yet I've tried to answer your questions below.

Ok, some meat, good.

> Well that's sort of what we want. Not an array of u32s but a list of
> pids. Among other things Resource Groups uses configfs to let userspace
> create named groups of tasks. The list of pids describing the group of
> tasks can easily exceed one page in size when output via the members
> configfs attribute.

And here we go. In userspace, someone can't just do:

ATTR=$(cat /sys/kernel/config/ckrm/myresource/attr)

because ATTR needs to be split and parsed. And what if you decide to
change it from "<pid>\n" to "<pid> <tgid>\n" per line? Oops, can't do
it, or you break an ABI.

> Since there are so many of them and they don't have the same lifetime
> characteristics as a configfs object we don't want to incur the space
> and complexity cost of creating and managing one configfs attribute per
> pid. At the same time, the groups themselves do have the lifetime of a
> configfs object and there are few of them.

First, I agree that pids as an object each is very expensive.
If the groups are few, of course, creating a sysfs object per group
isn't that hard :-)

> Where's the large-attribute discussion going on?

lkml and also offline, though we're bringing it back to lkml.
I'm going to cc this too, I hope you don't mind.

> WARNING: It's possible I've confused my configfs terminology a bit in
> my description below.

I'll try and make do. :-)

> So for example we would have items with the following attributes:
>
> /sys/kernel/config/res_group/members:
> 1
> 2
> 3
> 4567

Is this toplevel members file the "not part of any other group"
list?

> /sys/kernel/config/res_group/mhelsleys_tasks/members:
> 4568
>
> /sys/kernel/config/res_group/jbeckers_tasks/members:
> 4569
>
> This is how Resource Groups (formerly known as CKRM) currently
> implements the list -- as a single attribute with one pid printed per
> line. This doesn't fit the accepted OVPA rule and doesn't fit in one
> page on large systems with crazy numbers of tasks.
>
> We've considered creating one attribute per pid but frankly it makes
> the rest of the code much nastier in every way (code complexity, memory
> use, locking, etc). We also considered your earlier suggestion of having
> a custom sysfs file for this. However it just doesn't fit: each list is
> different so we'd need one file per resource group. This naturally
> suggests mimicking the configfs tree in /sys but that's uniformly worse
> than one pid per attribute.

How is this more complex? One object per attribute is large,
but if the number of groups is "few" as you say above, isn't that
simpler?

> The OVPA rule exists for some very good reasons. We want to avoid the
> nightmares encountered with /proc. Specifically the nightmare of
> organizing and parsing all of those structure values. /proc files often
> incorporate too much structure into the body of the file and that in
> turn requires complex parsing; which tends to break compatibility as the
> structures they reflect change. (I've worked on some userspace C code
> that parses some /proc/sys/net files and, yes, it's nasty..)
>
> Hence I think we could allow one value per line in addition to one
> value per attribute -- but only under very specific restrictions. The
> restrictions could be enforced by an additional configfs object which
> represents simple lists and lacks direct access to the char buffer.
> Also, unlike Chandra's seq_file patches, a different interface would
> avoid impacting existing uses of configfs.

You're effectively suggesting that a specific attribute type of
"repeated value of type X". No mixed types, no exploded structures,
just a "list of this attr" sort of thing. This does fit my personal
requirement of avoiding a generic, abusable system.
Greg, what do you think?

Joel

--

"Depend on the rabbit's foot if you will, but remember, it didn't
help the rabbit."
- R. E. Shay

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127


2006-10-12 21:44:34

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

> And what if you decide to
> change it from "<pid>\n" to "<pid> <tgid>\n" per line?

I think that's a good argument for never changing the format of one
of these files, rather than a good argument for against a vector of
scalars of identical type and purpose.

And I'd agree that we should not use multiple values per file to
represent a structure either - so I'd agree that we should not allow
"<pid> <tgid>\n" in the first place.

In the cpuset file system:
1) There is one value, or one vector of equivalent scalar values, per file.
2) Once released into the wild, a file never changes what it does or how
it looks.
3) It's ok to add new files.
4) But, at least in the case of cpusets, not ok to add directories, as
the file system represents one directory per one cpuset. No other
directories not representing cpusets are allowed.

This configfs flap feels to me like someone slightly overgeneralized
the lesson to be learned from previous problems displaying entire,
evolving, structures in a single file, and then is being a bit over
zealous enforcing the resulting rule.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-12 22:51:55

by Joel Becker

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Thu, Oct 12, 2006 at 02:44:20PM -0700, Paul Jackson wrote:
> > And what if you decide to
> > change it from "<pid>\n" to "<pid> <tgid>\n" per line?
>
> I think that's a good argument for never changing the format of one
> of these files, rather than a good argument for against a vector of
> scalars of identical type and purpose.

Sure, no dispute here. My argument point isn't that "vector of
scalars is inherently evil" (I'll leave that aside), it's that a
facility allowing or encouraging the format change or blowup is
unhealthy.

> And I'd agree that we should not use multiple values per file to
> represent a structure either - so I'd agree that we should not allow
> "<pid> <tgid>\n" in the first place.

Cool, we're together on that as well. I know that we can't
fully prevent stupidity, but encouraging better behavior is something I
always try to do.

> This configfs flap feels to me like someone slightly overgeneralized
> the lesson to be learned from previous problems displaying entire,
> evolving, structures in a single file, and then is being a bit over
> zealous enforcing the resulting rule.

Someone (hi, it's me) rather tried to overspecialize configfs.
It's intentionally not all things to all people. Kitchen sinks cause
clogs, as it were.
And I'm intentionally overzealous enforcing it. Better I need
to be beat over the head before I accept a good idea than I accept bad
ones with an "eh?"

Joel

--

One look at the From:
understanding has blossomed
.procmailrc grows
- Alexander Viro

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2006-10-13 00:01:51

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

Joel wrote:
> Sure, no dispute here. My argument point isn't that "vector of
> scalars is inherently evil" (I'll leave that aside), it's that a
> facility allowing or encouraging the format change or blowup is
> unhealthy.

Ok ... yes simply extending the size is changing the protocol
at too low a level ... an invitation to future abuse. I'll agree
somewhat with your concern there.

Is there some way to accept an array of scalars? Some change
in the API that configfs presents to the kernel code using it,
that would let that code say "here's an array of u32, of length
so-and-so?"

Hmmm ... having spouted off for three messages now ... decided to
actually look at this ;).

Let me consider a different approach ...

At the top of Documentation/filesystems/configfs/configfs.txt,
I see:

configfs is a ram-based filesystem that provides the converse of
sysfs's functionality. Where sysfs is a filesystem-based view of
kernel objects, configfs is a filesystem-based manager of kernel
objects, or config_items.

I guess this means that sysfs presents the gauges, and configfs
the knobs.

This seems like a bit of an artificial split to me - view (mostly
read) here, and manage (mostly write) here.

Not entirely surprisingly, I'm fond of the cpuset approach -- all
things cpuset-related in one place, both viewing and managing.

Similarly, /proc is a repository for all things task related, and
/dev for all devices (well, with various exceptions, contradictions,
confusions, and historical raisins ...)

There should indeed be a place for various kernel guages and knobs that
don't merit having their own entire file system. And since we already
have a separate sysfs and configfs, better to leave that aspect be, as
two separate name spaces ... despite my rant about this being an
artificial split a few lines above.

And if configfs wants to (continue to) impose a rule that there is a
single, simple scalar per file, that may well fit its needs and guide
its future extensions in the best way.

But if something like Resource Groups comes along, I'd think it deserves
its own file system name space (though sometimes I am sympathetic to
suggestions to merge this one into the cpuset name space -- not sure.)

Underneath each of these filesystems, sysfs, configfs, cpuset, resource
groups, ... it would seem ideal if we had a single kernel file system
infrastructure. Actually, we're only a half a layer from having that,
with vfs. It just takes a fair bit of glue to construct any of these
file systems out of vfs primitives.

I wonder if there might be someway to share that glue? I am envisioning
a new glue layer, mostly in the kernel internal headers and lib
directory, that sits on top of the current vfs, and makes it easy for
virtual file system presenters such as sysfs, configfs, cpusets and
resource groups to construct the particular virtual file system they
require.

I would expect this glue layer to support vectors of scalars, even if
some of its users, such as configfs, chose not to expose that
possibility. Arguments between configfs and resource group developers
over the value of presenting vectors suggest we've gone astray -
enforcing commonality where it is not beneficial to do so.

I would not expect such a change to using common vfs glue to change
much, if at all, existing client kernel code that uses configfs or
sysfs (or cpusets.) The configfs example code in:
Documentation/filesystems/configfs/configfs_example.c
should continue to work, as is.

I would expect to reduce a little some cut and paste code duplication,
hence reduce the kernel text size and reduce kernel maintenance and
improve a little the ease with which future features (Resource Groups,
say) can add their own virtual file system hierarchies.

This means creating a new kernel internal API, by which the various
clients (sysfs, configfs, cpusets, ...) of this common glue layer can
represent the particular instance of such a virtual file system that
they require, and hook in their operational methods. Then it means
porting the existing such virtual file systems, such as configfs, sysfs
and cpusets, to this common glue infrastructure.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-13 23:37:43

by Matt Helsley

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Thu, 2006-10-12 at 00:08 -0700, Joel Becker wrote:
> On Wed, Oct 11, 2006 at 07:18:36PM -0700, Matt Helsley wrote:
> > On Wed, 2006-10-11 at 15:06 -0700, Joel Becker wrote:

<snip>

> > /sys/devices/pciXXXX:XX/XXXX:XX:XX.X/resource
> > /sys/block/hda/stat
> > /sys/block/hda/dev
>
> Sure, but mere existence doesn't make it right ;-)
>
> > Each of these has more than one number in them. And they don't all use
> > the same delimeter. So yes, parsing is required in existing sysfs files.
> > And the parsing I'm proposing is simpler than that needed in the pci
> > resource file.
>
> As sysfs grew, it became very clear that if you require parsing,
> someone in userspace will get it wrong. It's happened with pci
> resources, etc. What about people who say "I'll just add a field, no
> one will notice" and everything breaks? That's just not acceptible.

I agree. And that means introducing new ABIs should have a high barrier
to acceptance. But in my opinion it also shouldn't prohibit carefully
considered deviation from the guidelines.

> > In fact we've discussed the problem with you in the past (on ckrm-tech)
> > and described what we're looking for. You suggested adding file(s) to
> > sysfs that replace this attribute. It was an interesting suggestion but
> > does not work because we have one of these lists for each item in
> > configfs. I think all of this discussion is in the ckrm-tech list
> > archives too (in MARC, possibly gmane, etc.).
>
> Sure it works. You have one per resource group. In
> resource_group_make_object(), you sysfs_mkdir() the sysfs file. There

That's the easy part. Next we need to make the pid attribute whenever a
new task is created. And delete it when the task dies. And move it
around whenever it changes groups. Is there rename() support in /sys? If
not, would changes to allow rename() be acceptable (I'm worried it would
impact alot of assumptions made in the existing code)?

> is no technical limitation at all. Now, it is nice that the pid list
> lives in the configfs tree from a location standpoint. Sure. But is it
> worth breaking the simplicity of configfs. Greg says NO! You say YES!
> I say "I'm not sure". And here we are.

Consider that having two very similar (but not symlinked!) trees in
both /sys/ ... /res_group and /sys/kernel/config/res_group could be
rather confusing to userspace programmers and users alike.

It would be strange because when you rmdir a group
in /sys/kernel/config/res_group... a directory in /sys would also
disappear. Yet you can't mkdir or rmdir the /sys dirs. And to edit the
group resources you'd be editting stuff in configfs but to add or remove
tasks to the group you'd be mv'ing stuff in /sys..

> > A patch would best explain what I'm proposing. But since I don't have a
> > working patch yet I've tried to answer your questions below.
>
> Ok, some meat, good.
>
> > Well that's sort of what we want. Not an array of u32s but a list of
> > pids. Among other things Resource Groups uses configfs to let userspace
> > create named groups of tasks. The list of pids describing the group of
> > tasks can easily exceed one page in size when output via the members
> > configfs attribute.
>
> And here we go. In userspace, someone can't just do:
>
> ATTR=$(cat /sys/kernel/config/ckrm/myresource/attr)

Hmm, that suggests a good point. While some one *can* do that or:

ATTR=( $(cat /sys/kernel/config/ckrm/myresource/attr) )

the space available for environment variables is limited. So attempting
to store a large (What's "large"?) attribute in an environment variable
is a potentially buggy practice. This is a significant problem affecting
large attributes in general.

> because ATTR needs to be split and parsed. And what if you decide to
> change it from "<pid>\n" to "<pid> <tgid>\n" per line? Oops, can't do
> it, or you break an ABI.

Once we have the pid the rest is easily reachable. I think Paul
Jackson's response -- we can add but not change existing files -- is the
best answer here.

> > Since there are so many of them and they don't have the same lifetime
> > characteristics as a configfs object we don't want to incur the space
> > and complexity cost of creating and managing one configfs attribute per
> > pid. At the same time, the groups themselves do have the lifetime of a
> > configfs object and there are few of them.
>
> First, I agree that pids as an object each is very expensive.
> If the groups are few, of course, creating a sysfs object per group
> isn't that hard :-)

One per group isn't hard or very expensive -- but that's not the issue
I was bringing up. One per pid is expensive because we'd need to alloc
and associate that object with the task. And then, if we're tearing down
all resource groups (the root group) we need to iterate over all the
tasks and free the object. We'd also need to free the object when the
task is freed. Those can race so we'd need some locking...

This is much more complex than being allowed to print out one pid per
line.

<snip>

> > WARNING: It's possible I've confused my configfs terminology a bit in
> > my description below.
>
> I'll try and make do. :-)
>
> > So for example we would have items with the following attributes:
> >
> > /sys/kernel/config/res_group/members:
> > 1
> > 2
> > 3
> > 4567
>
> Is this toplevel members file the "not part of any other group"
> list?

Yes. They all are -- a pid can only be in one members file. Pids 1, 2,
3, 4567 won't show up in any other members file.

> > /sys/kernel/config/res_group/mhelsleys_tasks/members:
> > 4568

pid 4568 won't show up in any other members file.

> > /sys/kernel/config/res_group/jbeckers_tasks/members:
> > 4569

etc.

> > We've considered creating one attribute per pid but frankly it makes
> > the rest of the code much nastier in every way (code complexity, memory
> > use, locking, etc). We also considered your earlier suggestion of having
> > a custom sysfs file for this. However it just doesn't fit: each list is
> > different so we'd need one file per resource group. This naturally
> > suggests mimicking the configfs tree in /sys but that's uniformly worse
> > than one pid per attribute.
>
> How is this more complex? One object per attribute is large,
> but if the number of groups is "few" as you say above, isn't that
> simpler?

There are two parts to the complexity: code complexity and the number
of userspace pieces to deal with. I think that in both of these
categories the OVPA approach is more complex. Here's how I see it:

|Complexity of|
| OVPA | OVPL |
----------------------+------+------+
Code | more | less |
| | |
Number of filesystems | | |
reflecting resource | 2 | 1 |
group hierarchy | | |
| | |
Attribute | less | more |
complexity | | |
----------------------+------+------+
Total | more | less |

I think the "more" in the OVPL column is much much smaller than the
"more" in the OVPA column. Lastly the OVPL code can be shared while the
OVPA code most likely cannot be shared.

> > The OVPA rule exists for some very good reasons. We want to avoid the
> > nightmares encountered with /proc. Specifically the nightmare of
> > organizing and parsing all of those structure values. /proc files often
> > incorporate too much structure into the body of the file and that in
> > turn requires complex parsing; which tends to break compatibility as the
> > structures they reflect change. (I've worked on some userspace C code
> > that parses some /proc/sys/net files and, yes, it's nasty..)
> >
> > Hence I think we could allow one value per line in addition to one
> > value per attribute -- but only under very specific restrictions. The
> > restrictions could be enforced by an additional configfs object which
> > represents simple lists and lacks direct access to the char buffer.
> > Also, unlike Chandra's seq_file patches, a different interface would
> > avoid impacting existing uses of configfs.
>
> You're effectively suggesting that a specific attribute type of
> "repeated value of type X". No mixed types, no exploded structures,
> just a "list of this attr" sort of thing. This does fit my personal
> requirement of avoiding a generic, abusable system.

Exactly.

Cheers,
-Matt Helsley

2006-10-14 00:10:51

by Joel Becker

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Fri, Oct 13, 2006 at 04:37:38PM -0700, Matt Helsley wrote:
> > Sure it works. You have one per resource group. In
> > resource_group_make_object(), you sysfs_mkdir() the sysfs file. There
>
> That's the easy part. Next we need to make the pid attribute whenever a
> new task is created. And delete it when the task dies. And move it
> around whenever it changes groups. Is there rename() support in /sys? If
> not, would changes to allow rename() be acceptable (I'm worried it would
> impact alot of assumptions made in the existing code)?

No, you don't create a pid attribute per task. The sysfs file
is literally your large attribute. So, instead of echoing a new pid to
"/sys/kernel/config/ckrm/group1/pids", you echo to
"/sys/ckrm/group1/pids". To display them all, you just cat
"/sys/ckrm/group1/pids". It's exactly like the file you want in
configfs, just located in a place where it is allowed.

> Consider that having two very similar (but not symlinked!) trees in
> both /sys/ ... /res_group and /sys/kernel/config/res_group could be
> rather confusing to userspace programmers and users alike.

Not really. It's not identical (tons of attributes live in the
configfs part but not the sysfs part), and it has a clear deliniation of
what each does.

> It would be strange because when you rmdir a group
> in /sys/kernel/config/res_group... a directory in /sys would also
> disappear. Yet you can't mkdir or rmdir the /sys dirs. And to edit the

This is no different than tons of sysfs and procfs functionality
today.

> Hmm, that suggests a good point. While some one *can* do that or:
>
> ATTR=( $(cat /sys/kernel/config/ckrm/myresource/attr) )
>
> the space available for environment variables is limited. So attempting
> to store a large (What's "large"?) attribute in an environment variable
> is a potentially buggy practice. This is a significant problem affecting
> large attributes in general.

If you can't do it in sh, it's pretty much out of scope. This
is a taste rule I use, because like to shell program. Sure, not the end
of the world (not a hard rule, I guess), but worth thinking about.

> There are two parts to the complexity: code complexity and the number
> of userspace pieces to deal with. I think that in both of these
> categories the OVPA approach is more complex. Here's how I see it:

By your definition, sysfs, configfs, and other fs-style control
mechanisms are too complex. We should all just be using ioctl() so that
coders and users have only one namespace :-)

> > You're effectively suggesting that a specific attribute type of
> > "repeated value of type X". No mixed types, no exploded structures,
> > just a "list of this attr" sort of thing. This does fit my personal
> > requirement of avoiding a generic, abusable system.
>
> Exactly.

How do you implement it? Full on seq_file with restrictions
(ops->start,stop,next,show)? Some sort of array (how do I placehold
where the last read(2) was)? Some sort of linked list (again with the
placeholding and locking)? Anything short of seq_file+restrictions
would be perhaps binding that traversal, no?

Joel

--

"When I am working on a problem I never think about beauty. I
only think about how to solve the problem. But when I have finished, if
the solution is not beautiful, I know it is wrong."
- Buckminster Fuller

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2006-10-14 04:40:43

by Greg KH

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Thu, Oct 12, 2006 at 05:01:25PM -0700, Paul Jackson wrote:
> Underneath each of these filesystems, sysfs, configfs, cpuset, resource
> groups, ... it would seem ideal if we had a single kernel file system
> infrastructure. Actually, we're only a half a layer from having that,
> with vfs. It just takes a fair bit of glue to construct any of these
> file systems out of vfs primitives.

No, it's pretty small to create a ram-based filesystem these days. Look
at securityfs and debugfs for two simple examples that you should copy
if you want to create your own. If you can come up with ways of making
those files smaller, then that would be great. But as it is, it is
_really_ simple to do this right now, especially with two working
examples to copy from :)

thanks,

greg k-h

2006-10-15 01:06:51

by Matt Helsley

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Fri, 2006-10-13 at 17:09 -0700, Joel Becker wrote:
> On Fri, Oct 13, 2006 at 04:37:38PM -0700, Matt Helsley wrote:
> > > Sure it works. You have one per resource group. In
> > > resource_group_make_object(), you sysfs_mkdir() the sysfs file. There
> >
> > That's the easy part. Next we need to make the pid attribute whenever a
> > new task is created. And delete it when the task dies. And move it
> > around whenever it changes groups. Is there rename() support in /sys? If
> > not, would changes to allow rename() be acceptable (I'm worried it would
> > impact alot of assumptions made in the existing code)?
>
> No, you don't create a pid attribute per task. The sysfs file
> is literally your large attribute. So, instead of echoing a new pid to
> "/sys/kernel/config/ckrm/group1/pids", you echo to
> "/sys/ckrm/group1/pids". To display them all, you just cat
> "/sys/ckrm/group1/pids". It's exactly like the file you want in
> configfs, just located in a place where it is allowed.

Oh, sorry. I was still operating on the one-value-per-attribute
assumption. This indeed looks like it would work.

> > Consider that having two very similar (but not symlinked!) trees in
> > both /sys/ ... /res_group and /sys/kernel/config/res_group could be
> > rather confusing to userspace programmers and users alike.
>
> Not really. It's not identical (tons of attributes live in the
> configfs part but not the sysfs part), and it has a clear deliniation of
> what each does.

Clear delineation to who? I'm not convinced this is any less confusing
to a userspace programmer than parsing a single newline between multiple
values in a configfs attribute.

> > It would be strange because when you rmdir a group
> > in /sys/kernel/config/res_group... a directory in /sys would also
> > disappear. Yet you can't mkdir or rmdir the /sys dirs. And to edit the
>
> This is no different than tons of sysfs and procfs functionality
> today.

Yup.

> > There are two parts to the complexity: code complexity and the number
> > of userspace pieces to deal with. I think that in both of these
> > categories the OVPA approach is more complex. Here's how I see it:
>
> By your definition, sysfs, configfs, and other fs-style control
> mechanisms are too complex. We should all just be using ioctl() so that
> coders and users have only one namespace :-)

That's an absurd conclusion to draw from my argument that one
filesystem-based approach is less complex than another.

> > > You're effectively suggesting that a specific attribute type of
> > > "repeated value of type X". No mixed types, no exploded structures,
> > > just a "list of this attr" sort of thing. This does fit my personal
> > > requirement of avoiding a generic, abusable system.
> >
> > Exactly.
>
> How do you implement it? Full on seq_file with restrictions
> (ops->start,stop,next,show)?

That was the plan.

Cheers,
-Matt Helsley

2006-10-15 19:07:45

by Paul Jackson

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

Joel, responding to Matt, wrote:
> > Hmm, that suggests a good point. While some one *can* do that or:
> >
> > ATTR=( $(cat /sys/kernel/config/ckrm/myresource/attr) )
> >
> > the space available for environment variables is limited. So attempting
> > to store a large (What's "large"?) attribute in an environment variable
> > is a potentially buggy practice. This is a significant problem affecting
> > large attributes in general.
>
> If you can't do it in sh, it's pretty much out of scope. This
> is a taste rule I use, because like to shell program. Sure, not the end
> of the world (not a hard rule, I guess), but worth thinking about.

I too like to shell program. Such potentially long vectors can
be worked in the shell, just not placed in a single command line
argument or environment variable.

Constructs that do work (using the list of process id's in the
file /dev/cpuset/tasks for these examples) include:

1. while read pid
do
... do something with each $pid ...
done < /dev/cpuset/tasks


2. < /dev/cpuset/tasks xargs ps -flp


3. sed -un p < /dev/cpuset/foo/tasks > /dev/cpuset/tasks

Such shell constructs for dealing with vectors that might be longer
than argument or environment length limits have been needed and known
for decades.

In an earlier message, Joel wrote:
> You're effectively suggesting that a specific attribute type of
> "repeated value of type X". No mixed types, no exploded structures,
> just a "list of this attr" sort of thing. This does fit my personal
> requirement of avoiding a generic, abusable system.

Yes - what I call above a "vector."

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-10-16 19:33:46

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Fri, 2006-10-13 at 17:09 -0700, Joel Becker wrote:
> On Fri, Oct 13, 2006 at 04:37:38PM -0700, Matt Helsley wrote:
> > > Sure it works. You have one per resource group. In
> > > resource_group_make_object(), you sysfs_mkdir() the sysfs file. There
> >
> > That's the easy part. Next we need to make the pid attribute whenever a
> > new task is created. And delete it when the task dies. And move it
> > around whenever it changes groups. Is there rename() support in /sys? If
> > not, would changes to allow rename() be acceptable (I'm worried it would
> > impact alot of assumptions made in the existing code)?
>
> No, you don't create a pid attribute per task. The sysfs file
> is literally your large attribute. So, instead of echoing a new pid to
> "/sys/kernel/config/ckrm/group1/pids", you echo to
> "/sys/ckrm/group1/pids". To display them all, you just cat
> "/sys/ckrm/group1/pids". It's exactly like the file you want in
> configfs, just located in a place where it is allowed.

>From what I see, sysfs also has the PAGESIZE limitation. If that _is_
the case, then moving to sysfs does not help us any. Correct me if I am
wrong.

Won't we have the same arguments that we have now ?

<snip>

>
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------


2006-10-16 23:08:12

by Joel Becker

[permalink] [raw]
Subject: Re: [ckrm-tech] [PATCH 0/5] Allow more than PAGESIZE data read in configfs

On Mon, Oct 16, 2006 at 12:33:41PM -0700, Chandra Seetharaman wrote:
> >From what I see, sysfs also has the PAGESIZE limitation. If that _is_
> the case, then moving to sysfs does not help us any. Correct me if I am
> wrong.
>
> Won't we have the same arguments that we have now ?

If you use a 'struct attribute' in sysfs, sure, you have the
same limit. The difference is that sysfs allows you to install your own
files. So you could create a seq_file of your own and insert it into
your sysfs directory.

Joel

--

"I think it would be a good idea."
- Mahatma Ghandi, when asked what he thought of Western
civilization

Joel Becker
Principal Software Developer
Oracle
E-mail: [email protected]
Phone: (650) 506-8127