Hi!
After having hacked up a patch to convert /proc/net/ip_conntrack
to seq_file (which was quite ah experience, I will comment on that in a
different mail), I am facing a different issue.
I haven't found a way to use seq_file for information that is not
exported as a global variable.
Let's say I have some hash tables that are allocated during runtime of
the system, on users demand. I have no way of knowing how many there
will be and how the user will want to call them.
For every of those hashtables I want to create a file in /proc and
export the data using seq_file(). Since the data objects are all the
same, I'd like to use the same seq_operations.{start,next,show,stop}
functions. The whole struct seq_operations would be part of a larger
structure that already exists for every hash table.
However, how do I know which hashtable is to be read, when the
seq_operations.start() function is called? I would somehow need a
pointer back to the hashtable from the file itself. And please don't
tell me to call d_path() and find the correct hash table by the
filename.
The problem is, that seq_file is already using the file.private_data
member...
Any ideas?
Thanks for enlightening a networking hacker about the magic of the
virtual filesystem ;)
Please Cc' me in replies, that makes the job easier for my mail filters.
--
- Harald Welte <[email protected]> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
On Fri, 14 Nov 2003, Harald Welte wrote:
> The problem is, that seq_file is already using the file.private_data
> member...
there is a seq_file->private pointer where you can store private (per file
structure) data. That is what I do and it works very nice, BUT the problem
is that the ->private pointer was only added to seq_file recently
(2.4.20 if I remember correctly) although seq_file API was present since
2.4.15.
Kind regards
Tigran
On Fri, Nov 14, 2003 at 09:01:58PM +0000, Tigran Aivazian wrote:
> On Fri, 14 Nov 2003, Harald Welte wrote:
> > The problem is, that seq_file is already using the file.private_data
> > member...
>
> there is a seq_file->private pointer where you can store private (per file
> structure) data. That is what I do and it works very nice, BUT the problem
> is that the ->private pointer was only added to seq_file recently
> (2.4.20 if I remember correctly) although seq_file API was present since
> 2.4.15.
that doesn't help. As I am aware, the seq_file structure is only
allocated in the seq_open() call. How does seq_open() know which
private data (i.e. hash table) to associate with struct file?
The only moment I know which htable corresponds to a proc entry is at
the time where I call proc_net_create() [or a similar function]. So the
information would need to be associated with the dentry or whatever...
seq_file() is allocated way too late.
> Kind regards
> Tigran
--
- Harald Welte <[email protected]> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
On Sat, Nov 15, 2003 at 10:38:33AM +0100, Harald Welte wrote:
> that doesn't help. As I am aware, the seq_file structure is only
> allocated in the seq_open() call. How does seq_open() know which
> private data (i.e. hash table) to associate with struct file?
Why should seq_open() know that? Its caller does and it can set the damn
thing to whatever it wants.
> The only moment I know which htable corresponds to a proc entry is at
> the time where I call proc_net_create() [or a similar function]. So the
> information would need to be associated with the dentry or whatever...
> seq_file() is allocated way too late.
Wrong.
On Sat, Nov 15, 2003 at 05:18:44PM +0000, [email protected] wrote:
> On Sat, Nov 15, 2003 at 10:38:33AM +0100, Harald Welte wrote:
> > that doesn't help. As I am aware, the seq_file structure is only
> > allocated in the seq_open() call. How does seq_open() know which
> > private data (i.e. hash table) to associate with struct file?
>
> Why should seq_open() know that? Its caller does and it can set the damn
> thing to whatever it wants.
So who is the caller? it's the ->open() member of struct
file_operations. and struct file_operations doesn't have some private
member where I could hide my pointer before saving it to
seq_file.private in seq_open().
> Wrong.
Hm, maybe somebody could enlighten me then. Maybe this is a stupid
qestion, but I wasn't able to figure that out after reading all the
structures, etc.
--
- Harald Welte <[email protected]> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
Dear Harald,
The thing to be aware of is that seq API is limited to 1 page of data per
read(2) call. Some people loudly proclaim "seq API is unlimited, unlike
the old /proc formalism which was limited to 1 page" but they are
quiet about 1-page limitation for a single read(2) (due to hardcoded
kmalloc(size) in seq_file.c). Why is this important? Because if your
->start/stop routines take/drop some spinlocks then you have to know your
position and re-verify it on the next read(2) if your data is more than 1
page and thus could not be read atomically (i.e. while holding the
spinlocks).
The way I do it is by looking at the integer 'offset' parameter passed to
'start' and walk the linked list to the same 'distance' from the head (if
there is still anything that far) and verify if I am looking at the "same"
(in quotes because the concept of "same" has to be defined, not
with 100% accuracy :) element as it was at the end of the previous page.
The above was the only unpleasant complication with dealing with seq API.
The rest seemed very smooth and worked as expected. And the ->private
field of seq_file was very useful to maintain the information per open
instance of the file, i.e. per 'file' structure. I didn't understand why
it is not applicable in your case, but maybe I am missing the details of
your implementation or something like that.
Kind regards
Tigran
On Sat, Nov 15, 2003 at 07:49:43PM +0000, Tigran Aivazian wrote:
> Dear Harald,
>
> The thing to be aware of is that seq API is limited to 1 page of data per
> read(2) call. Some people loudly proclaim "seq API is unlimited, unlike
> the old /proc formalism which was limited to 1 page" but they are
> quiet about 1-page limitation for a single read(2) (due to hardcoded
> kmalloc(size) in seq_file.c). Why is this important? Because if your
What? It will happily return more than 1 page if you get an entry longer
than that.
What hardcoded kmalloc(size)?
> ->start/stop routines take/drop some spinlocks then you have to know your
> position and re-verify it on the next read(2) if your data is more than 1
> page and thus could not be read atomically (i.e. while holding the
> spinlocks).
If you mean that read(2) can decide to return less than full user buffer even
though more data is available - tough, that's something userland *must* be
able to deal with.
Especially on syntetic and potentially large files - the only alternative
would be immediate DoS on OOM. Regardless of the implementation.
On Sat, Nov 15, 2003 at 06:33:10PM +0100, Harald Welte wrote:
> On Sat, Nov 15, 2003 at 05:18:44PM +0000, [email protected] wrote:
> > On Sat, Nov 15, 2003 at 10:38:33AM +0100, Harald Welte wrote:
> > > that doesn't help. As I am aware, the seq_file structure is only
> > > allocated in the seq_open() call. How does seq_open() know which
> > > private data (i.e. hash table) to associate with struct file?
> >
> > Why should seq_open() know that? Its caller does and it can set the damn
> > thing to whatever it wants.
>
> So who is the caller? it's the ->open() member of struct
> file_operations. and struct file_operations doesn't have some private
> member where I could hide my pointer before saving it to
> seq_file.private in seq_open().
If arguments of ->open() were not enough to find your data, how the hell would
current code manage to find it?
You've got inode; you've got (if that's on procfs) proc_dir_entry - from
inode; you've got dentry (from struct file *). If that's not enough to
find your data, what is?
Which files do you have in mind?
On Sat, 15 Nov 2003 [email protected] wrote:
> > The thing to be aware of is that seq API is limited to 1 page of data per
> > read(2) call. Some people loudly proclaim "seq API is unlimited, unlike
> > the old /proc formalism which was limited to 1 page" but they are
> > quiet about 1-page limitation for a single read(2) (due to hardcoded
> > kmalloc(size) in seq_file.c). Why is this important? Because if your
>
> What? It will happily return more than 1 page if you get an entry longer
> than that.
>
> What hardcoded kmalloc(size)?
>
in fs/seq_file.c:seq_read()
if (!m->buf) {
m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
> > ->start/stop routines take/drop some spinlocks then you have to know your
> > position and re-verify it on the next read(2) if your data is more than 1
> > page and thus could not be read atomically (i.e. while holding the
> > spinlocks).
>
> If you mean that read(2) can decide to return less than full user buffer even
> though more data is available - tough, that's something userland *must* be
> able to deal with.
are you saying that it is possible for userspace to request, say, 256K and
say 250K of available data could be returned (with seq_file API) on a
single read(2) call? I thought this is impossible and hence in the
->show() routine I do this (similar to seq_printf()):
struct task_struct *p = (struct task_struct *)v;
struct proc *proc = (struct proc *)m->private;
pack some data from task_struct into 'proc' ...
if (m->count + PROCLEN < m->size) {
memcpy(m->buf + m->count, proc, PROCLEN);
m->count += PROCLEN;
return 0;
} else {
m->count = m->size;
return -1;
}
So, as soon as ->show() is asked to display an element which didn't fit in
the m->buf page it returns -1 and so the user gets a page (or almost a
page, i.e. as many entries as fit in there).
If you know of a way to return more than a page of data in one go, please
do tell me.
But if such way existed then why in mm/slab.c:s_start() does exactly what
I described, i.e. walks the list to the offset '*pos' saved from previous
page?
Kind regards
Tigran
>
> Especially on syntetic and potentially large files - the only alternative
> would be immediate DoS on OOM. Regardless of the implementation.
>
On Sat, 15 Nov 2003 [email protected] wrote:
> If you mean that read(2) can decide to return less than full user buffer even
> though more data is available - tough, that's something userland *must* be
> able to deal with.
I think I now understand better what you mean. And that is not what I
meant. Yes, of course userspace must be able to deal with less data than
what it asked for. I agree.
But I was referring to the complication on the kernel side, whereby if the
data (collectively, all the entries) is more than 1 page then ->stop()
must be called and a page returned to user and then the kernel must build
another page but all it knows is the integer 'offset'. Anything could have
happened to the list (task list in this case) since ->stop() routine
dropped the spinlock. So, it is not obvious from which position to
start building the data for the new page.
Looking at mm/slab.c implementation I see that it just walks the integer
distance from the head of the list. Simple but not 100% correct, I think.
I.e. it can miss an entry if the list has changed between two read(2)s.
Kind regards
Tigran
On Sat, Nov 15, 2003 at 08:50:55PM +0000, Tigran Aivazian wrote:
> Looking at mm/slab.c implementation I see that it just walks the integer
> distance from the head of the list. Simple but not 100% correct, I think.
> I.e. it can miss an entry if the list has changed between two read(2)s.
Define "miss". What sort of warranties do you expect there?
On Sat, Nov 15, 2003 at 08:41:45PM +0000, Tigran Aivazian wrote:
> in fs/seq_file.c:seq_read()
>
> if (!m->buf) {
> m->buf = kmalloc(m->size = PAGE_SIZE, GFP_KERNEL);
... and below
m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
> are you saying that it is possible for userspace to request, say, 256K and
> say 250K of available data could be returned (with seq_file API) on a
> single read(2) call? I thought this is impossible and hence in the
> So, as soon as ->show() is asked to display an element which didn't fit in
> the m->buf page it returns -1 and so the user gets a page (or almost a
> page, i.e. as many entries as fit in there).
a) it's not a page; if any ->show() will need more, buffer will grow.
b) unless you are willing to allocate 250Kb per read(), you *can't* do
what you are asking for. Regardless of implementation. Which is an
immediate DoS (again, regardless of implementation).
read() is not atomic. Never had been. With seq_file you are guaranteed
that what you get will consist of entire entries (IOW, if previous read()
had ended inside the entry, the rest will be preserved for the next read()).
There are no stronger warranties and there never had been.
On Sat, 15 Nov 2003 [email protected] wrote:
> On Sat, Nov 15, 2003 at 08:50:55PM +0000, Tigran Aivazian wrote:
>
> > Looking at mm/slab.c implementation I see that it just walks the integer
> > distance from the head of the list. Simple but not 100% correct, I think.
> > I.e. it can miss an entry if the list has changed between two read(2)s.
>
> Define "miss". What sort of warranties do you expect there?
Ok, suppose the list of entries consists of elements:
A -> B -> C -> D ->( A )
where "->" is a "next" pointer and the last element points back to A, i.e.
circular list.
Suppose each element requires 2040 bytes of data to be passed to the user
app.
Now, the user application issues a call:
struct elem elem[10];
len = read(fd, elem, 10*sizeof(struct elem));
(small digression about PAGE_SIZE Now, if you just strace dd
if=/proc/slabinfo bs=8192 you will see that a read for 8192 bytes returned
only 4056 bytes and then the next read for 8192 bytes returned 66 bytes.
This is why I was saying that a single read is limited to a single page.)
Now, back to our abstract list. The normal (intuitive, e.g. seq_printf())
->show() implementation will manage to pack A and B into m->buf but
hitting the element C it will have to return -1 (otherwise the kernel will
oops because m->buf isn't large enough). So, the userspace got back A and
B in a single read, i.e. 4080 bytes. On the next read() the ->start()
routine will try to arrive at the element with offset 2 (I think it
counted from 0). And since ->stop() routine was called we dropped the
spinlock guarding the list. So, which element is now at distance 2 from A?
It's not necessarily C because it could have been unlinked. But this is OK
because if C was unlinked then we don't want to see it anyway. The bad
case is if the element B was unlinked then the element at distance 2 would
be D and we would "miss" C. This is the sense in which we can "miss" an
element.
Is there any error in the above?
Kind regards
Tigran
Hi Al,
I remembered the other two areas where, maybe, seq API can be slightly
improved:
a) no "THIS_MODULE" style module refcounting, so I had to do manual
MOD_INC_USE_COUNT/MOD_DEC_USE_COUNT in ->open/release. I am aware of the
deficiencies of this approach, of course (it's been discussed too many
times in the last several years).
b) no way to reset the 'offset' to 0 when the ->next() detects that it is
back at the head of linked list, i.e. when it should return NULL. It's OK
for a user app to detect that (e.g. check proc->pid == 0 thus it's a
"swapper" and so the beginning of the next chunk of processes) but it
also has to issue lseek(fd, 0, SEEK_SET), otherwise the offsets will keep
growing larger and larger and the kernel has to loop around that list (in
->start() when it tries to walk a set distance from the head) many times
unnecessarily and so the performance goes down. I tried doing something
like this:
m->index = m->count = m->from = *ppos = 0;
in the ->next() function whenever it detects that the 'next' element is
'init_task' but it didn't help. And I understand why, i.e. read(2) really
is supposed to increment the offset correctly, so what I require would
break the normal Unix read(2) semantics. So, maybe forcing user app to
issue lseek(fd, 0, SEEK_SET) is the only sensible solution... If you have
better ideas, please let me know.
Kind regards
Tigran
On Sat, Nov 15, 2003 at 08:50:55PM +0000, Tigran Aivazian wrote:
> But I was referring to the complication on the kernel side, whereby if the
> data (collectively, all the entries) is more than 1 page then ->stop()
> must be called and a page returned to user and then the kernel must build
> another page but all it knows is the integer 'offset'. Anything could have
> happened to the list (task list in this case) since ->stop() routine
> dropped the spinlock. So, it is not obvious from which position to
> start building the data for the new page.
Yes, I am well aware of that issue. As for the connection tracking
table or the dstlimit htables I was referring to, I am actually relating
'pos' to the bucket number, not to the logical entry in the table.
The number of hash buckets is constant, so that's at least something the
user can count on.
However, there will be a problem when there are too many entries within
a single bucket - since all of them would have to be returned within a
single read() call.
> Kind regards
> Tigran
--
- Harald Welte <[email protected]> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
On Sat, Nov 15, 2003 at 08:36:07PM +0000, [email protected] wrote:
> > So who is the caller? it's the ->open() member of struct
> > file_operations. and struct file_operations doesn't have some private
> > member where I could hide my pointer before saving it to
> > seq_file.private in seq_open().
>
> If arguments of ->open() were not enough to find your data, how the hell would
> current code manage to find it?
>
> You've got inode; you've got (if that's on procfs) proc_dir_entry - from
> inode; you've got dentry (from struct file *). If that's not enough to
> find your data, what is?
thanks, I've now found a way to deal with the problem. after calling
create_proc_entry(), I put a pointer to my hash table in
proc_dir_entry->data. The proc_dir_entry->proc_fops->open() function
then calls seq_open() and sets seq_file->private to PDE(inode). This
way the seq_operations->start/next/show functions can find out the
> Which files do you have in mind?
It's not part of the stock kernel, but something I've been working on
the last couple of days. An iptables match module called 'dstlimit'.
(http://cvs.netfilter.org/netfilter-extensions/matches_targets/dstlimit/)
This match creates a new hash table (yes a table, not just an entry) for
every rule in which you use that match. Mainly for testing/debugging
purpose, I want to be able to read the contents of each hash table via a
/proc file (/proc/net/ipt_dstlimit/*).
The implementation is now done in the way I described above, and it
seems to be working quite fine.
Thanks to everybody helping me with this issue.
--
- Harald Welte <[email protected]> http://www.netfilter.org/
============================================================================
"Fragmentation is like classful addressing -- an interesting early
architectural error that shows how much experimentation was going
on while IP was being designed." -- Paul Vixie
On Sun, Nov 16, 2003 at 07:27:23AM +0000, Tigran Aivazian wrote:
> Hi Al,
>
> I remembered the other two areas where, maybe, seq API can be slightly
> improved:
>
> a) no "THIS_MODULE" style module refcounting, so I had to do manual
> MOD_INC_USE_COUNT/MOD_DEC_USE_COUNT in ->open/release. I am aware of the
> deficiencies of this approach, of course (it's been discussed too many
> times in the last several years).
You don't need to do that. Look, these ->open() and ->release()
are not some new methods - they are ->open() and ->release() of your
struct file. The fact that they happen to call functions from seq_file.c
doesn't change anything - they are struct file methods, sitting in some
instance of struct file_operations. And just as with any such instance,
you have ->owner in struct file_operations. Which will be honoured by
open(2) - just as with any other file.
IOW, there is no need for any special rmmod protection of iterator.
Normal protection of file methods will be enough - after all, even if
iterator is not in the same module, the code in our ->open() directly
refers to it. I.e. we have a direct dependency and as long as module
where our file_operations are is there, the module with our iterators
will stay around.
> b) no way to reset the 'offset' to 0 when the ->next() detects that it is
> back at the head of linked list, i.e. when it should return NULL. It's OK
Let me get it straight - you want an infinite file, with no EOF anywhere
and contents more or less repeating itself? _And_ you want a working
lseek() on that?
On Mon, Nov 17, 2003 at 08:21:34AM +0000, Tigran Aivazian wrote:
> Now, since there is no way to detect EOF, other than by reading an extra
> page and discovering that it belongs to the next iteration, we have to do
> the lseek(fd, 0, SEEK_SET) anyway.
> So, the "auto-rewinding" read would only help the cases where application
> doesn't need to differentiate between samples and is happy to just
> continuously read chunks packed into pages one by one as fast as
> possible. In this case it doesn't need to lseek to 0, so auto-rewinding on
> kernel side would prevent it from slowing down.
If you're going to repeatedly read from 0 pread() sounds like a good
alternative to read() + lseek(), plus no kernel changes required to
get rid of the lseek().
-- wli
On Mon, 17 Nov 2003 [email protected] wrote:
> > a) no "THIS_MODULE" style module refcounting, so I had to do manual
> > MOD_INC_USE_COUNT/MOD_DEC_USE_COUNT in ->open/release. I am aware of the
> > deficiencies of this approach, of course (it's been discussed too many
> > times in the last several years).
>
> You don't need to do that. Look, these ->open() and ->release()
> are not some new methods - they are ->open() and ->release() of your
> struct file. The fact that they happen to call functions from seq_file.c
> doesn't change anything - they are struct file methods, sitting in some
> instance of struct file_operations. And just as with any such instance,
> you have ->owner in struct file_operations. Which will be honoured by
> open(2) - just as with any other file.
Thank you. Yes, I have now changed my code to do what you suggest and
it works fine.
> > b) no way to reset the 'offset' to 0 when the ->next() detects that it is
> > back at the head of linked list, i.e. when it should return NULL. It's OK
>
> Let me get it straight - you want an infinite file, with no EOF anywhere
> and contents more or less repeating itself? _And_ you want a working
> lseek() on that?
I only need lseek to offset 0 and it does work when called from userspace
as lseek(fd, 0, SEEK_SET). But what I wanted is "auto-rewinding read()".
So, when the driver detects that we reached EOF (i.e. hit 'init_task' in
->next()) it should somehow cause the offset to be reset to 0 but return
the read(2) to user as normal. I didn't figure out how to do this. The
intuitive solution *ppos = 0 in ->next() didn't work.
On the other hand, some more thought suggests that even if such
"auto-rewinding read" existed, perhaps it wouldn't help after all.
Because, the user app is doing the following to get information about all
processes efficiently:
first_loop = 1
while(1) {
read as much as can fit in a page (i.e. single read(2) syscall)
if !first_loop and the first element of the page (it can only be
first) is 'swapper'
then
lseek(fd, 0, SEEK_SET);
break;
fi
if (first_loop)
first_loop = 0;
}
So, we read an extra page in vain on each iteration. But this is OK as
performance is not an issue (my module is already 26 times faster than a
highly optimized NT implementation).
Now, since there is no way to detect EOF, other than by reading an extra
page and discovering that it belongs to the next iteration, we have to do
the lseek(fd, 0, SEEK_SET) anyway.
So, the "auto-rewinding" read would only help the cases where application
doesn't need to differentiate between samples and is happy to just
continuously read chunks packed into pages one by one as fast as
possible. In this case it doesn't need to lseek to 0, so auto-rewinding on
kernel side would prevent it from slowing down.
Kind regards
Tigran
On Mon, 17 Nov 2003, William Lee Irwin III wrote:
> On Mon, Nov 17, 2003 at 08:21:34AM +0000, Tigran Aivazian wrote:
> > Now, since there is no way to detect EOF, other than by reading an extra
> > page and discovering that it belongs to the next iteration, we have to do
> > the lseek(fd, 0, SEEK_SET) anyway.
> > So, the "auto-rewinding" read would only help the cases where application
> > doesn't need to differentiate between samples and is happy to just
> > continuously read chunks packed into pages one by one as fast as
> > possible. In this case it doesn't need to lseek to 0, so auto-rewinding on
> > kernel side would prevent it from slowing down.
>
> If you're going to repeatedly read from 0 pread() sounds like a good
> alternative to read() + lseek(), plus no kernel changes required to
> get rid of the lseek().
The reason why I didn't use pread(2) is because I have to do multiple
calls to read(2). There is no way that I know of to pack more than a
single page into a single read(2) with seq_file API.
Yes, I remember Al saying "it's not a page" but in practice it still
appears to be limited to a page unless someone shows a sample seq_file
module which can provide more than a page of data on a single read(2). The
implementations I have looked at in the kernel (e.g. mm/slab.c) are
limited to a single page per read(2).
Kind regards
Tigran
On Mon, Nov 17, 2003 at 08:38:24AM +0000, Tigran Aivazian wrote:
> The reason why I didn't use pread(2) is because I have to do multiple
> calls to read(2). There is no way that I know of to pack more than a
> single page into a single read(2) with seq_file API.
Well, one way to do it is to say that if you don't get a short read,
then the file is longer.
On Mon, Nov 17, 2003 at 08:38:24AM +0000, Tigran Aivazian wrote:
> Yes, I remember Al saying "it's not a page" but in practice it still
> appears to be limited to a page unless someone shows a sample seq_file
> module which can provide more than a page of data on a single read(2). The
> implementations I have looked at in the kernel (e.g. mm/slab.c) are
> limited to a single page per read(2).
There's a retry loop where the buffer size is doubled each iteration
that looks to me like automagic sizing in the code for seq_read(). I
can't say I've actually tried to rely on getting more than a page
at a time through seq_read(), though.
-- wli
The retry loop:
while (1) {
pos = m->index;
p = m->op->start(m, &pos);
err = PTR_ERR(p);
if (!p || IS_ERR(p))
break;
err = m->op->show(m, p);
if (err)
break;
if (m->count < m->size)
goto Fill;
m->op->stop(m, p);
kfree(m->buf);
m->buf = kmalloc(m->size <<= 1, GFP_KERNEL);
if (!m->buf)
goto Enomem;
m->count = 0;
}
On Mon, Nov 17, 2003 at 12:48:04AM -0800, William Lee Irwin III wrote:
> There's a retry loop where the buffer size is doubled each iteration
> that looks to me like automagic sizing in the code for seq_read(). I
> can't say I've actually tried to rely on getting more than a page
> at a time through seq_read(), though.
Woops. This looks like it only makes sure there's enough to get the
first ->show() into the buffer; I see that it later gives up when
m->count == m->size once the first ->show() has enough bufferspace to
complete later on. So if all ->show() operations to do are less than
PAGE_SIZE, it'll never hand back more than a page at a time, and may
hand back short reads prior to EOF, which doesn't bode well for my
short read == EOF idea (maybe not a great assumption in general).
-- wli
On Mon, Nov 17, 2003 at 09:42:09AM +0000, Tigran Aivazian wrote:
> Here are two files: simple.c kernel module and user.c user test program.
> If you (or anyone) believe it is possible to return more than a single
> page on a read(2) please change them accordingly and let me know.
Sorry, I needed to amend that. It's not a fixed PAGE_SIZE buffer; the
buffer is only resized up to the point it allows a single ->show()
call to succeed, and then you get short reads if you try to go beyond
the buffer size in a single read.
You could (in theory) get this to succeed > PAGE_SIZE reads by doing
the operation all in a single ->show() call, but that will have some
large overheads and will also have a retry loop where the buffer size
is doubled until the ->show() call succeeds incurred at least once
per open() at the time of the first read().
-- wli
On Mon, Nov 17, 2003 at 01:03:39AM -0800, William Lee Irwin III wrote:
> On Mon, Nov 17, 2003 at 12:48:04AM -0800, William Lee Irwin III wrote:
> > There's a retry loop where the buffer size is doubled each iteration
> > that looks to me like automagic sizing in the code for seq_read(). I
> > can't say I've actually tried to rely on getting more than a page
> > at a time through seq_read(), though.
>
> Woops. This looks like it only makes sure there's enough to get the
> first ->show() into the buffer; I see that it later gives up when
> m->count == m->size once the first ->show() has enough bufferspace to
> complete later on. So if all ->show() operations to do are less than
> PAGE_SIZE, it'll never hand back more than a page at a time, and may
> hand back short reads prior to EOF, which doesn't bode well for my
> short read == EOF idea (maybe not a great assumption in general).
That's "broken assumption", thank you very much. There is one and only one
indication of EOF - read() returning 0. Any application that treats short
read as EOF is broken. As in "apply Stevens to offender's cranium until
something gives; if it's the book, replace with fresh copy and repeat".
Here are two files: simple.c kernel module and user.c user test program.
If you (or anyone) believe it is possible to return more than a single
page on a read(2) please change them accordingly and let me know.
simple.c
=======================
/*
* simple.c --- Provide access to process data via /proc/scan
*/
#include <linux/init.h>
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/mm.h>
#include <linux/file.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
#include <asm/uaccess.h>
/* Red Hat Linux 2.4.20-8 has its own next_task() macro */
#ifndef next_task
#define next_task(p) (p)->next_task
#endif
struct proc {
int pid; /* process id */
int uid; /* real user id */
char comm[17]; /* process name */
char wasted[200]; /* wasted space, filled with 'A' */
};
#define PROCLEN (sizeof(struct proc))
MODULE_DESCRIPTION("Simple module for testing seq_file");
MODULE_AUTHOR("Tigran Aivazian <[email protected]>");
MODULE_LICENSE("GPL");
EXPORT_NO_SYMBOLS;
static int simple_show(struct seq_file *m, void *v)
{
struct task_struct *p = (struct task_struct *)v;
struct proc proc;
proc.pid = p->pid;
proc.uid = p->uid;
strncpy(proc.comm, p->comm, 16);
memset(proc.wasted, 'A', 200);
if (m->count + PROCLEN < m->size) {
memcpy(m->buf + m->count, &proc, PROCLEN);
m->count += PROCLEN;
return 0;
} else {
m->count = m->size;
return -1;
}
}
static void *c_start(struct seq_file *m, loff_t *ppos)
{
struct task_struct *p = &init_task;
int n = *ppos;
read_lock(&tasklist_lock);
if (n == 0)
return (void *)p;
while (n--)
p = next_task(p);
return (void *)p;
}
static void *c_next(struct seq_file *m, void *v, loff_t *ppos)
{
struct task_struct *p= (struct task_struct *)v;
p = next_task(p);
if (p == &init_task)
p = NULL;
else
++*ppos;
return (void *)p;
}
static void c_stop(struct seq_file *m, void *v)
{
read_unlock(&tasklist_lock);
}
struct seq_operations seqops = {
.start = c_start,
.next = c_next,
.stop = c_stop,
.show = simple_show,
};
static int simple_open(struct inode *inode, struct file *file)
{
return seq_open(file, &seqops);
}
static struct file_operations fileops = {
.owner = THIS_MODULE,
.open = simple_open,
.read = seq_read,
.llseek = seq_lseek,
.release = seq_release,
};
static int __init simple_init(void)
{
struct proc_dir_entry *entry = create_proc_entry("scan", 0, NULL);
if (entry) {
entry->proc_fops = &fileops;
printk(KERN_INFO "simple: /proc/scan created\n");
return 0;
} else {
printk(KERN_ERR "simple: failed to create /proc/scan\n");
return -1;
}
}
static void __exit simple_exit(void)
{
remove_proc_entry("scan", 0);
}
module_init(simple_init)
module_exit(simple_exit)
user.c
=============================
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <string.h>
#include <errno.h>
struct proc {
int pid; /* process id */
int uid; /* real user id */
char comm[17]; /* process name */
char wasted[200]; /* wasted space, filled with 'A' */
};
#define PROCLEN (sizeof(struct proc))
#define MAXPROCS 100
struct proc procs[MAXPROCS];
int main(int argc, char *argv[])
{
int fd, len, nproc, i, j, count = 0;
static int first_scan = 1;
fd = open("/proc/scan", O_RDONLY);
if (fd == -1) {
fprintf(stderr, "%s: open(\"/proc/scan\"), errno=%d (%s)\n",
argv[0], errno, strerror(errno));
exit(1);
}
while (1) {
len = read(fd, procs, MAXPROCS*PROCLEN);
nproc = len/PROCLEN;
if ((procs[0].pid == 0) && !first_scan) {
printf("Total: count=%d procs\n\n", count);
count = 0;
break;
}
printf("\tnproc=%d\n", nproc);
if (first_scan)
first_scan = 0;
count += nproc;
printf("%d bytes (%d procs)\n", len, nproc);
for (i=0; i<nproc; i++) {
printf("proc[%d]: pid=%d, uid=%d, comm=<%s>\n",
i, procs[i].pid, procs[i].uid, procs[i].comm);
for (j=0; j<200; j++)
if (procs[i].wasted[j] != 'A') {
fprintf(stderr, "Corruption i=%d, j=%d\n", i, j);
exit(1);
}
}
}
return 0;
}
On Mon, Nov 17, 2003 at 09:42:09AM +0000, Tigran Aivazian wrote:
> Here are two files: simple.c kernel module and user.c user test program.
> If you (or anyone) believe it is possible to return more than a single
> page on a read(2) please change them accordingly and let me know.
> while (1) {
> len = read(fd, procs, MAXPROCS*PROCLEN);
> nproc = len/PROCLEN;
Broken userland code. You expect read() to return a multiple of PROCLEN,
for one thing. For another, you have (several lines below) a broken
loop termination logics.
EOF had been reached when read() returns 0. Until then read() returns
an arbitrary amount of bytes between 1 and 'size' argument. Since you
are using read(2) directly, use it correctly...
On Mon, 17 Nov 2003 [email protected] wrote:
> EOF had been reached when read() returns 0. Until then read() returns
> an arbitrary amount of bytes between 1 and 'size' argument. Since you
> are using read(2) directly, use it correctly...
I know that for read(2) in general but I thought that the whole point of
using "sequential record" files (aka seq_file) is that they guarantee
read(2) to return a number of fixed-size records, hence the name
"sequential record". Especially since the ->show() function is packing
those records into m->buf as "whole" entities, not in "halves" or such.
I believe you of course (as you wrote seq_file) but it still seems that as
long as the implementation (i.e. my module) is working with whole records
it is ok to assume that read() will return them as whole, i.e. not break a
record between two calls to read(2). Is this really not true?
Kind regards
Tigran
On Mon, Nov 17, 2003 at 10:08:45AM +0000, Tigran Aivazian wrote:
> On Mon, 17 Nov 2003 [email protected] wrote:
> > EOF had been reached when read() returns 0. Until then read() returns
> > an arbitrary amount of bytes between 1 and 'size' argument. Since you
> > are using read(2) directly, use it correctly...
>
> I know that for read(2) in general but I thought that the whole point of
> using "sequential record" files (aka seq_file) is that they guarantee
> read(2) to return a number of fixed-size records, hence the name
> "sequential record". Especially since the ->show() function is packing
> those records into m->buf as "whole" entities, not in "halves" or such.
No. What you are guaranteed is that continuous reading from the stream
will not give you broken entries. IOW, if read() gets a part of record,
subsequent reads will get the rest of _that_ _record_.
Note that behaviour you are asking for would break a *lot* of things.
Consider the following code:
for (left = BUFSIZE; left; left -= n) {
n = read(fd, buf + BUFSIZE - left, left);
if (n <= 0)
break;
}
IOW, fill the buffer from file, until EOF/error/buffer becoming full.
With your semantics we are fscked - as soon as we have less than one
"record" left in a buffer, we are in a hopeless situation. read()
can't return an entire record (no place to put it); it can't return
0 (that would be impossible to distinguish from EOF) and there is
no acceptable error value that would indicate such situation. read(2)
is not getdents(2) - there we *do* have an error value for situation
like that ("buffer too small").
Pretty much anything that does buffered IO (e.g. stdio code) contains
equivalents of the above.
_If_ you want datagrams - implement datagrams and don't expect grep/cat/etc.
to work on them.
> I believe you of course (as you wrote seq_file) but it still seems that as
> long as the implementation (i.e. my module) is working with whole records
> it is ok to assume that read() will return them as whole, i.e. not break a
> record between two calls to read(2). Is this really not true?
It's really not true.
*NOTE*: in your case I would consider putting a cursor into task list
and moving it around on ->start() and ->stop(). That can give you
more or less accurate snapshot with multiple read(2).
However, it's not obvious that the thing is worth the effort (you'd need
to make sure that no other code scanning the list will be disturbed when
it sees your cursor).
What is *not* acceptable (regardless of the implementation, be it done via
seq_file or not): ability to get the entire list of processes in single
read(2). User-exploitable OOM is not a good thing. No matter how you
implement read(2), you would have to get sufficient amount of memory locked
in core - you need to hold a spinlock to generate the contents, to start
with, so no "copy to userland in chunks" scheme would work.