On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
> On Thu, 7 Dec 2000, Tigran Aivazian wrote:
> > On Thu, 7 Dec 2000, Szabolcs Szakacsits wrote:
> > > Read the whole get_empty_filp function, especially this part,
> > I have read the whole function, including the above code, of course. The
> > new_one label has nothing to do with freelists -- it adds the file to the
> > anon_list, where the new arrivales from the slab cache go. The goto
> > new_one above is there simply to initialize the structure with sane
> > initial values
> OK, 2.2 has put_inuse(f); instead of putting it to anon_list, so 2.4
> seems ok.
Back to common sense ;) Nevertheless what you wrote additionally
get_empty_filp returns an allocated file struct that gets to be used.
So ignoring your four emails arguing kernel is ok, I downloaded
2.4-test11-pre7 and tried it out.
root# echo 1024 > /proc/sys/fs/file-max
Unpatched kernel,
user% ./fd-exhaustion # e.g. while(1) open("/dev/null",...);
root# cat /proc/sys/fs/file-nr
cat: /proc/sys/fs/file-nr: Too many open files in system
The above happens even with increased NR_RESERVED_FILES to 96 [no
wonder, get_empty_filp is broken].
With the patch below,
user% ./fd-exhaustion
root# cat /proc/sys/fs/file-nr
946 0 1024
or
1024 78 1024
or
something that also works
The patch also has a fix not to allocate potentially more file
structs than NR_FILES on SMP.
Unfortunately NR_RESERVED_FILES needs to be increased to be useful
[i.e. e.g. to make ssh|login+ps|kill work for superuser]. Other way
would be to more aggressively free unused file structs if kernel is
short on free fd's.
> > There are even books (Understanding the Linux
> > Kernel by Bovet et all) which describe this freelist in the
> > current context so your patch will require updates to the books.
Checked this part of the book, ok for 2.0 but not for 2.[24].
Szaka
diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c
--- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec 8 08:17:12 2000
+++ linux/fs/file_table.c Sun Dec 10 17:05:55 2000
@@ -32,39 +32,36 @@
{
static int old_max = 0;
struct file * f;
+ int total_free;
file_list_lock();
- if (files_stat.nr_free_files > NR_RESERVED_FILES) {
- used_one:
- f = list_entry(free_list.next, struct file, f_list);
- list_del(&f->f_list);
- files_stat.nr_free_files--;
- new_one:
- memset(f, 0, sizeof(*f));
- atomic_set(&f->f_count,1);
- f->f_version = ++event;
- f->f_uid = current->fsuid;
- f->f_gid = current->fsgid;
- list_add(&f->f_list, &anon_list);
- file_list_unlock();
- return f;
- }
- /*
- * Use a reserved one if we're the superuser
- */
- if (files_stat.nr_free_files && !current->euid)
- goto used_one;
- /*
- * Allocate a new one if we're below the limit.
- */
- if (files_stat.nr_files < files_stat.max_files) {
+ total_free = files_stat.max_files - files_stat.nr_files + files_stat.nr_free_files;
+ if (total_free > NR_RESERVED_FILES || (total_free && !current->euid)) {
+ if (files_stat.nr_free_files) {
+ /* used_one */
+ f = list_entry(free_list.next, struct file, f_list);
+ list_del(&f->f_list);
+ files_stat.nr_free_files--;
+ new_one:
+ memset(f, 0, sizeof(*f));
+ atomic_set(&f->f_count,1);
+ f->f_version = ++event;
+ f->f_uid = current->fsuid;
+ f->f_gid = current->fsgid;
+ list_add(&f->f_list, &anon_list);
+ file_list_unlock();
+ return f;
+ }
+ /*
+ * Allocate a new one if we're below the limit.
+ */
+ files_stat.nr_files++;
file_list_unlock();
f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
file_list_lock();
- if (f) {
- files_stat.nr_files++;
+ if (f)
goto new_one;
- }
+ files_stat.nr_files--;
/* Big problems... */
printk("VFS: filp allocation failed\n");
diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.4.0-test12-pre7/include/linux/fs.h Fri Dec 8 15:06:55 2000
+++ linux/include/linux/fs.h Sun Dec 10 17:37:52 2000
@@ -57,7 +57,7 @@
extern int leases_enable, dir_notify_enable, lease_break_time;
#define NR_FILE 8192 /* this can well be larger on a larger system */
-#define NR_RESERVED_FILES 10 /* reserved for root */
+#define NR_RESERVED_FILES 96 /* reserved for root */
#define NR_SUPER 256
#define MAY_EXEC 1
Hi,
> user% ./fd-exhaustion # e.g. while(1) open("/dev/null",...);
> root# cat /proc/sys/fs/file-nr
> cat: /proc/sys/fs/file-nr: Too many open files in system
>
> The above happens even with increased NR_RESERVED_FILES to 96 [no
> wonder, get_empty_filp is broken].
no, it is not broken. But your experiment is broken. Don't do cat file-nr
but compile this C program
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[])
{
int fd, len;
static char buf[2048];
fd = open("/proc/sys/fs/file-nr", O_RDONLY);
if (fd == -1) {
perror("open");
exit(1);
}
while (1) {
len = read(fd, buf, 1024);
printf("len=%d %s", len, buf);
lseek(fd, 0, SEEK_SET);
sleep(1);
}
return 0;
}
and leave it running while doing experiments on the other console. You
will see that everything is fine -- there is no bug. No wonder you saw the
bug -- you ignored my 4 emails telling you otherwise :)
Regards,
Tigran
Hi,
I have just retested under test12-pre7 and confirm my previous findings
i.e. there is no problem with get_empty_filp(), at least of the kind you
describe.
The definition of reserved file structures for root is -- those which are
taken from the freelist until it is not empty. In this sense it works and
I don't see why you think this definition is useless (it was like that for
ages, at least for the entire duration of 2.4.x kernels).
In brief -- it works, no patching is needed, unless you found some other
problem (e.g. you mentioned something about allocating more than NR_FILES
on SMP -- what do you mean?) which you are not explaining clearly.
You just say "it is broken and here is the patch" but that, imho, is not
enough. (ok, one could overcome the laziness and actually _read_ your
patch to see what you _think_ is broken but surely it is better if you
explain it yourself?). The current state of get_empty_filp() is simple and
readable -- how can it be broken?
Regards,
Tigran
On Sun, 10 Dec 2000, Tigran Aivazian wrote:
> enough. (ok, one could overcome the laziness and actually _read_ your
> patch to see what you _think_ is broken but surely it is better if you
> explain it yourself?).
This was not meant to say that I have not read your patch. Of course, I
have read it and already commented that you seem to be suggesting a new
policy for reserved file allocation, namely to allow touching the slab
cache for the sake of satisfying the root'd allocation requests. This,
imho, is unnecessary because the current design of
get_empty_filp() makes use of the freelist for this purpose -- there is no
need to extend or drop this constraint -- the simple check if freelist has
elements still, is quite sufficient.
Regards,
Tigran
On Sun, 10 Dec 2000, Tigran Aivazian wrote:
> > user% ./fd-exhaustion # e.g. while(1) open("/dev/null",...);
> > root# cat /proc/sys/fs/file-nr
> > cat: /proc/sys/fs/file-nr: Too many open files in system
> >
> > The above happens even with increased NR_RESERVED_FILES to 96 [no
> > wonder, get_empty_filp is broken].
>
> no, it is not broken. But your experiment is broken. Don't do cat file-nr
> but compile this C program
Ok, now I understand why you can't see the problem ;) You lookup the
values in user space but I did it [additionally] in kernel space [also
I think I understand what happens ;)]. I guess with the code below you
claim I shouldn't see values like this when file struct allocations
started by user apps,
1024 0 1024
Or 0 shouldn't be between 0 and NR_RESERVED_FILES. Right? Wrong. I saw
it happens, you can reproduce it if you lookup the nr_free_files
value, allocate that much by root, don't release them and
immediately after this start to allocate fd's by user app. Note, if
you already hit nr_files = max_files you won't ever be able to
reproduce the above - but this is a half solution, kernel 2.0 was
fine, get_empty_filp was broke somewhere between 2.0 and 2.1 and it's
still broken. With the patch the functionality is back and also works
the way what the authors of the book mentioned believe ;)
It's quite funny, because before I was also told this is broken but I
couldn't believe it, so I look the code and tested it, the report was
right ...
Still disagree? ;)
Szaka
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> int main(int argc, char *argv[])
> {
> int fd, len;
> static char buf[2048];
>
> fd = open("/proc/sys/fs/file-nr", O_RDONLY);
> if (fd == -1) {
> perror("open");
> exit(1);
> }
> while (1) {
> len = read(fd, buf, 1024);
> printf("len=%d %s", len, buf);
> lseek(fd, 0, SEEK_SET);
> sleep(1);
> }
> return 0;
> }
>
> and leave it running while doing experiments on the other console. You
> will see that everything is fine -- there is no bug. No wonder you saw the
> bug -- you ignored my 4 emails telling you otherwise :)
>
> Regards,
> Tigran
>
>
On Sun, 10 Dec 2000, Tigran Aivazian wrote:
> problem (e.g. you mentioned something about allocating more than NR_FILES
> on SMP -- what do you mean?) which you are not explaining clearly.
E.g. situation, only one file struct left for allocation. One CPU goes
into get_empty_filp and before kmem_cache_alloc unlocks file_list,
another CPU gets also into get_empty_filp and locks file_list at the
top and goes on the same path, the end result potentially can be both
will increase nr_files instead of only one. But I don't think it's a
big issue at *present* that could cause any problems ...
> You just say "it is broken and here is the patch" but that, imho, is not
> enough. (ok, one could overcome the laziness and actually _read_ your
> patch to see what you _think_ is broken but surely it is better if you
> explain it yourself?).
Sorry I didn't explain, I thought it's short enough and significantly
faster to understand reading the code then my poor English ;)
Szaka
On Sun, 10 Dec 2000, Szabolcs Szakacsits wrote:
> Or 0 shouldn't be between 0 and NR_RESERVED_FILES. Right? Wrong. I saw
> it happens, you can reproduce it if you lookup the nr_free_files
> value, allocate that much by root, don't release them and
> immediately after this start to allocate fd's by user app.
Ok, let's slowly understand each other. The scenario you suggest is:
a) measure nr_free_files and let root exhaust all freelist entries. Now
the freelist is empty and nr_free_files = 0.
b) now let the user app allocate (from the slab cache) lots of new file
structures. He can keep doing so until nr_files hits max_files.
Now what? Now root tries to allocate some and obviously he can't because
the freelist is empty. So what? Where is the bug? This is an expected
behaviour to me -- since the freelist is empty there are no more reserved
file structures for root. Which part of this do you believe is wrong? Or
do you believe that this is not the case at all? I.e. something else
happens in the above scenario which I am still missing? If so, please
explain.
If, however, you believe that the above _is_ the case but it should _not_
happen then you are proposing a completely new policy of file structure
allocation which you believe is superior. It is quite possible so let's
all understand your new policy and let Linus decide whether it's better
than the existing one. But if so, don't tell me you are fixing a bug
because it is not a bug -- it's a redesign of file structure allocator.
Regards,
Tigran
On Sun, 10 Dec 2000, Tigran Aivazian wrote:
> Ok, let's slowly understand each other. The scenario you suggest is:
>
> a) measure nr_free_files and let root exhaust all freelist entries. Now
> the freelist is empty and nr_free_files = 0.
>
> b) now let the user app allocate (from the slab cache) lots of new file
> structures. He can keep doing so until nr_files hits max_files.
>
> Now what? Now root tries to allocate some and obviously he can't because
> the freelist is empty
.
... and neither can he allocate from the slab cache since nr_files ==
max_files now.
Now, if I understand your proposal correctly -- you would like to redefine
NR_RESERVED_FILES to be _guaranteed_ for root at any time regardless of
the allocation pattern instead of their current definition as guaranteed
number of elements on the _freelist_? Right?
So you would like to deny normal users some requests from the slab cache
if otherwise root's new NR_RESERVED_FILES wouldn't be honoured?
Did I understand your idea correctly? Such policy sounds sensible but is
certainly a redesign of file allocator and should be carefully checked if
it's ok in all cases...
If you agree with the above then we never disagreed to begin with -- I
just insisted (and still insist) that it is the expected behaviour,
perhaps not 100% satisfactory.
Also, I agree with your suggestion to increase NR_RESERVED_FILES -- with
both policies (the current and your new one) the current value of 10 is
way too small.
Regards,
Tigran
On Sun, 10 Dec 2000, Tigran Aivazian wrote:
> If, however, you believe that the above _is_ the case but it should _not_
> happen then you are proposing a completely new policy of file structure
> allocation which you believe is superior. It is quite possible so let's
> all understand your new policy and let Linus decide whether it's better
> than the existing one. But if so, don't tell me you are fixing a bug
> because it is not a bug -- it's a redesign of file structure allocator.
If it's not a bug then
- this comment from include/linux/fs.h should be deleted
#define NR_RESERVED_FILES 10 /* reserved for root */
- books should be updated
- people's mind also who believe kernel reserves fd's for superuser
Kernel from 2.1 plays lottery in this regards. And this would be
another sad fact that the kernel is exteremely poor *out of the box*
in regards security and relaibility ...
Szaka
On Sun, 10 Dec 2000, Szabolcs Szakacsits wrote:
>
> - this comment from include/linux/fs.h should be deleted
> #define NR_RESERVED_FILES 10 /* reserved for root */
well, not really -- it is "reserved" right now too, it is just root is
allowed to use up all the reserved entries in the beginning and then when
the normal user uses up all the "non-reserved" ones (from slab
cache) there would be nothing left for the root.
But let us not argue about the above definition of "reserved" -- that is
not productive. Let's do something productive -- namely, take your idea to
the next logical step. Since you have proven that the freelist mechanism
or concept of "reserve file structures" is not 100% satisfactory as is
then how about removing the freelist altogether? I.e. what about serving
each allocation request directly from the slab cache and imposing any
"reserved for root" policy purely by the nr_xxx_files counters?
The only argument against such idea would be "taking elements off the
freelist is probably faster than allocating from slab". But maybe not? Who
measured it? if slab allocator is so fast that the difference is
negligible then applying the same idea all over the kernel (e.g. to
superblocks etc) will remove a lot of code, which is a good thing.
Regards,
Tigran
On Sun, 10 Dec 2000, Tigran Aivazian wrote:
> On Sun, 10 Dec 2000, Szabolcs Szakacsits wrote:
> > - this comment from include/linux/fs.h should be deleted
> > #define NR_RESERVED_FILES 10 /* reserved for root */
> well, not really -- it is "reserved" right now too, it is just root is
> allowed to use up all the reserved entries in the beginning and then when
> the normal user uses up all the "non-reserved" ones (from slab
> cache) there would be nothing left for the root.
And what real functionality does this provide? Close to nada. This is
why I told you if you are right then it's usefuless. So I think this
is a bug that was introduced accidentaly overlooking NR_RESERVED_FILES
functionality when get_empty_filp was rewritten to use the slab.
> But let us not argue about the above definition of "reserved" -- that is
> not productive.
Agree, this is why I made the patch ;) Also, this stupid
misunderstanding and waste of time between us is a *very* typical
example of the result of the super inferior Linux kernel source code
management. No way to dig up who and why dropped the reserved file
functionality about three years ago. "Hidden", unexplained patches
slip in almost every patch-set. Some developers think they can save a
huge amount of time by this "model", they just ignore other developers
and support people who need to understand what, when, why and by who a
changes happened. And because of lack of enough information [look,
both of us have and I think understand the code, still we don't agree]
the end result is that, apparently by now too many times the ball is
dropped back to these developers who get buried by even more job. This
is just one sign Linux has a hard future and unfortunately there are
others .... In general Linux is still one of the best today but
without addressing and solving current development problems it will
not be true after a couple of years. Linux remains just another Unix
and lose in 1:100 to another OS. The source is with us but it should
be used properly ....
> Let's do something productive -- namely, take your idea to
> the next logical step. Since you have proven that the freelist mechanism
> or concept of "reserve file structures" is not 100% satisfactory as is
This is also a difference between us. You look the problem from a
theoretical point of you, saying it's not 100%, I consider it from
practical point of you and say it gives close to 0% functionality for
users.
> then how about removing the freelist altogether? I.e. what about serving
I'm fine with the current implementation and more interested in bug
fixes. There could be one reason against the patch, performance. The
patch below has the same fix and TUX will give exactly the same
numbers [get_empty_filp code remains ugly but at least fast].
Szaka
diff -ur linux-2.4.0-test12-pre7/fs/file_table.c linux/fs/file_table.c
--- linux-2.4.0-test12-pre7/fs/file_table.c Fri Dec 8 08:17:12 2000
+++ linux/fs/file_table.c Mon Dec 11 10:40:41 2000
@@ -57,7 +57,9 @@
/*
* Allocate a new one if we're below the limit.
*/
- if (files_stat.nr_files < files_stat.max_files) {
+ if ((files_stat.nr_files < files_stat.max_files) && (!current->euid ||
+ NR_RESERVED_FILES - files_stat.nr_free_files <
+ files_stat.max_files - files_stat.nr_files)) {
file_list_unlock();
f = kmem_cache_alloc(filp_cachep, SLAB_KERNEL);
file_list_lock();
diff -ur linux-2.4.0-test12-pre7/include/linux/fs.h linux/include/linux/fs.h
--- linux-2.4.0-test12-pre7/include/linux/fs.h Fri Dec 8 15:06:55 2000
+++ linux/include/linux/fs.h Sun Dec 10 17:37:52 2000
@@ -57,7 +57,7 @@
extern int leases_enable, dir_notify_enable, lease_break_time;
#define NR_FILE 8192 /* this can well be larger on a larger system */
-#define NR_RESERVED_FILES 10 /* reserved for root */
+#define NR_RESERVED_FILES 128 /* reserved for root */
#define NR_SUPER 256
#define MAY_EXEC 1