pts/blogbench-1.1.0 is a benchmark designed to replicate the
load of a real-world busy file server by multiple threads of
random reads, writes, and rewrites. When running default configuration
with multiple parallel threads, hot spin lock contention is observed
from alloc_fd(), file_closed_fd() and put_unused_fd() around file_lock.
These 3 patches are created to reduce the critical section of file_lock
in alloc_fd() and close_fd(). As a result, pts/blogbench-1.1.0 has been
improved by 32% for read and 15% for write with over 30% kernel cycles
reduced on ICX 160 cores configuration with v6.8-rc6.
Yu Ma (3):
fs/file.c: add fast path in alloc_fd()
fs/file.c: conditionally clear full_fds
fs/file.c: move sanity_check from alloc_fd() to put_unused_fd()
fs/file.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
--
2.43.0
There is available fd in the lower 64 bits of open_fds bitmap for most cases
when we look for an available fd slot. Skip 2-levels searching via
find_next_zero_bit() for this common fast path.
Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
free slot is available there, as:
(1) The fd allocation algorithm would always allocate fd from small to large.
Lower bits in open_fds bitmap would be used much more frequently than higher
bits.
(2) After fdt is expanded (the bitmap size doubled for each time of expansion),
it would never be shrunk. The search size increases but there are few open fds
available here.
(3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
searching.
With the fast path added in alloc_fd() through one-time bitmap searching,
pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
cores configuration with v6.8-rc6.
Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
fs/file.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index 3b683b9101d8..e8d2f9ef7fd1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
if (fd < files->next_fd)
fd = files->next_fd;
- if (fd < fdt->max_fds)
+ if (fd < fdt->max_fds) {
+ if (~fdt->open_fds[0]) {
+ fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
+ goto success;
+ }
fd = find_next_fd(fdt, fd);
+ }
/*
* N.B. For clone tasks sharing a files structure, this test
@@ -531,7 +536,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
*/
if (error)
goto repeat;
-
+success:
if (start <= files->next_fd)
files->next_fd = fd + 1;
--
2.43.0
64 bits in open_fds are mapped to a common bit in full_fds_bits. It is very
likely that a bit in full_fds_bits has been cleared before in
__clear_open_fds()'s operation. Check the clear bit in full_fds_bits before
clearing to avoid unnecessary write and cache bouncing. See commit fc90888d07b8
("vfs: conditionally clear close-on-exec flag") for a similar optimization.
Together with patch 1, they improves pts/blogbench-1.1.0 read for 28%, and write
for 14% on Intel ICX 160 cores configuration with v6.8-rc6.
Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
fs/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/file.c b/fs/file.c
index e8d2f9ef7fd1..a0e94a178c0b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -268,7 +268,9 @@ static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
static inline void __clear_open_fd(unsigned int fd, struct fdtable *fdt)
{
__clear_bit(fd, fdt->open_fds);
- __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);
+ fd /= BITS_PER_LONG;
+ if (test_bit(fd, fdt->full_fds_bits))
+ __clear_bit(fd, fdt->full_fds_bits);
}
static unsigned int count_open_files(struct fdtable *fdt)
--
2.43.0
alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
allocated fd is NULL. Move the sanity check from performance critical alloc_fd()
path to non performance critical put_unused_fd() path.
As the initial NULL FILE object condition can be assured by zero initialization
in init_file, we just need to make sure that it is NULL when recycling fd back.
There are 3 functions call __put_unused_fd() to return fd,
file_close_fd_locked(), do_close_on_exec() and put_unused_fd(). For
file_close_fd_locked() and do_close_on_exec(), they have implemented NULL check
already. Adds NULL check to put_unused_fd() to cover all release paths.
Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
Reviewed-by: Tim Chen <[email protected]>
Signed-off-by: Yu Ma <[email protected]>
---
fs/file.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index a0e94a178c0b..59d62909e2e3 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
else
__clear_close_on_exec(fd, fdt);
error = fd;
-#if 1
- /* Sanity check */
- if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
- printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
- rcu_assign_pointer(fdt->fd[fd], NULL);
- }
-#endif
out:
spin_unlock(&files->file_lock);
@@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
}
EXPORT_SYMBOL(get_unused_fd_flags);
-static void __put_unused_fd(struct files_struct *files, unsigned int fd)
+static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = files_fdtable(files);
__clear_open_fd(fd, fdt);
@@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
void put_unused_fd(unsigned int fd)
{
struct files_struct *files = current->files;
+ struct fdtable *fdt = files_fdtable(files);
spin_lock(&files->file_lock);
+ if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
+ printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
+ rcu_assign_pointer(fdt->fd[fd], NULL);
+ }
__put_unused_fd(files, fd);
spin_unlock(&files->file_lock);
}
--
2.43.0
On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
Total nitpick: FILE is the libc thing, I would refer to it as 'struct
file'. See below for the actual point.
> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
>
> Reviewed-by: Tim Chen <[email protected]>
> Signed-off-by: Yu Ma <[email protected]>
> ---
> fs/file.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index a0e94a178c0b..59d62909e2e3 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> else
> __clear_close_on_exec(fd, fdt);
> error = fd;
> -#if 1
> - /* Sanity check */
> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> - rcu_assign_pointer(fdt->fd[fd], NULL);
> - }
> -#endif
>
I was going to ask when was the last time anyone seen this fire and
suggest getting rid of it if enough time(tm) passed. Turns out it does
show up sometimes, latest result I found is 2017 vintage:
https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
So you are moving this to another locked area, but one which does not
execute in the benchmark?
Patch 2/3 states 28% read and 14% write increase, this commit message
claims it goes up to 32% and 15% respectively -- pretty big. I presume
this has to do with bouncing a line containing the fd.
I would argue moving this check elsewhere is about as good as removing
it altogether, but that's for the vfs overlords to decide.
All that aside, looking at disasm of alloc_fd it is pretty clear there
is time to save, for example:
if (unlikely(nr >= fdt->max_fds)) {
if (fd >= end) {
error = -EMFILE;
goto out;
}
error = expand_files(fd, fd);
if (error < 0)
goto out;
if (error)
goto repeat;
}
This elides 2 branches and a func call in the common case. Completely
untested, maybe has some brainfarts, feel free to take without credit
and further massage the routine.
Moreover my disasm shows that even looking for a bit results in
a func call(!) to _find_next_zero_bit -- someone(tm) should probably
massage it into another inline.
After the above massaging is done and if it turns out the check has to
stay, you can plausibly damage-control it with prefetch -- issue it
immediately after finding the fd number, before any other work.
All that said, by the above I'm confident there is still *some*
performance left on the table despite the lock.
> out:
> spin_unlock(&files->file_lock);
> @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> }
> EXPORT_SYMBOL(get_unused_fd_flags);
>
> -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> {
> struct fdtable *fdt = files_fdtable(files);
> __clear_open_fd(fd, fdt);
> @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> void put_unused_fd(unsigned int fd)
> {
> struct files_struct *files = current->files;
> + struct fdtable *fdt = files_fdtable(files);
> spin_lock(&files->file_lock);
> + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> + rcu_assign_pointer(fdt->fd[fd], NULL);
> + }
> __put_unused_fd(files, fd);
> spin_unlock(&files->file_lock);
> }
On Sat, Jun 15, 2024 at 06:41:45AM +0200, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
> > alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
>
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.
>
> > Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
> > 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
> >
> > Reviewed-by: Tim Chen <[email protected]>
> > Signed-off-by: Yu Ma <[email protected]>
> > ---
> > fs/file.c | 14 ++++++--------
> > 1 file changed, 6 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index a0e94a178c0b..59d62909e2e3 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> > else
> > __clear_close_on_exec(fd, fdt);
> > error = fd;
> > -#if 1
> > - /* Sanity check */
> > - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
> > - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
> > - rcu_assign_pointer(fdt->fd[fd], NULL);
> > - }
> > -#endif
> >
>
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
>
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?
>
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
>
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide.
>
> All that aside, looking at disasm of alloc_fd it is pretty clear there
> is time to save, for example:
>
> if (unlikely(nr >= fdt->max_fds)) {
> if (fd >= end) {
> error = -EMFILE;
> goto out;
> }
> error = expand_files(fd, fd);
> if (error < 0)
> goto out;
> if (error)
> goto repeat;
> }
>
Now that I wrote it I noticed the fd < end check has to be performed
regardless of max_fds -- someone could have changed rlimit to a lower
value after using a higher fd. But the main point stands: the call to
expand_files and associated error handling don't have to be there.
> This elides 2 branches and a func call in the common case. Completely
> untested, maybe has some brainfarts, feel free to take without credit
> and further massage the routine.
>
> Moreover my disasm shows that even looking for a bit results in
> a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> massage it into another inline.
>
> After the above massaging is done and if it turns out the check has to
> stay, you can plausibly damage-control it with prefetch -- issue it
> immediately after finding the fd number, before any other work.
>
> All that said, by the above I'm confident there is still *some*
> performance left on the table despite the lock.
>
> > out:
> > spin_unlock(&files->file_lock);
> > @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
> > }
> > EXPORT_SYMBOL(get_unused_fd_flags);
> >
> > -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > {
> > struct fdtable *fdt = files_fdtable(files);
> > __clear_open_fd(fd, fdt);
> > @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> > void put_unused_fd(unsigned int fd)
> > {
> > struct files_struct *files = current->files;
> > + struct fdtable *fdt = files_fdtable(files);
> > spin_lock(&files->file_lock);
> > + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
> > + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
> > + rcu_assign_pointer(fdt->fd[fd], NULL);
> > + }
> > __put_unused_fd(files, fd);
> > spin_unlock(&files->file_lock);
> > }
On Fri, Jun 14, 2024 at 12:34:14PM -0400, Yu Ma wrote:
> There is available fd in the lower 64 bits of open_fds bitmap for most cases
> when we look for an available fd slot. Skip 2-levels searching via
> find_next_zero_bit() for this common fast path.
>
> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
> free slot is available there, as:
> (1) The fd allocation algorithm would always allocate fd from small to large.
> Lower bits in open_fds bitmap would be used much more frequently than higher
> bits.
> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
> it would never be shrunk. The search size increases but there are few open fds
> available here.
> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
> searching.
>
> With the fast path added in alloc_fd() through one-time bitmap searching,
> pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
> cores configuration with v6.8-rc6.
>
> Reviewed-by: Tim Chen <[email protected]>
> Signed-off-by: Yu Ma <[email protected]>
> ---
> fs/file.c | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index 3b683b9101d8..e8d2f9ef7fd1 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> if (fd < files->next_fd)
> fd = files->next_fd;
>
> - if (fd < fdt->max_fds)
> + if (fd < fdt->max_fds) {
> + if (~fdt->open_fds[0]) {
> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
> + goto success;
> + }
> fd = find_next_fd(fdt, fd);
> + }
>
> /*
> * N.B. For clone tasks sharing a files structure, this test
> @@ -531,7 +536,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
> */
> if (error)
> goto repeat;
> -
> +success:
> if (start <= files->next_fd)
> files->next_fd = fd + 1;
>
As indicated in my other e-mail it may be a process can reach a certain
fd number and then lower its rlimit(NOFILE). In that case the max_fds
field can happen to be higher and the above patch will fail to check for
the (fd < end) case.
> --
> 2.43.0
>
On 6/15/2024 12:41 PM, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:16PM -0400, Yu Ma wrote:
>> alloc_fd() has a sanity check inside to make sure the FILE object mapping to the
> Total nitpick: FILE is the libc thing, I would refer to it as 'struct
> file'. See below for the actual point.
Good point, not nitpick at all ;) , will update the word in commit message.
>> Combined with patch 1 and 2 in series, pts/blogbench-1.1.0 read improved by
>> 32%, write improved by 15% on Intel ICX 160 cores configuration with v6.8-rc6.
>>
>> Reviewed-by: Tim Chen <[email protected]>
>> Signed-off-by: Yu Ma <[email protected]>
>> ---
>> fs/file.c | 14 ++++++--------
>> 1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index a0e94a178c0b..59d62909e2e3 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -548,13 +548,6 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> else
>> __clear_close_on_exec(fd, fdt);
>> error = fd;
>> -#if 1
>> - /* Sanity check */
>> - if (rcu_access_pointer(fdt->fd[fd]) != NULL) {
>> - printk(KERN_WARNING "alloc_fd: slot %d not NULL!\n", fd);
>> - rcu_assign_pointer(fdt->fd[fd], NULL);
>> - }
>> -#endif
>>
> I was going to ask when was the last time anyone seen this fire and
> suggest getting rid of it if enough time(tm) passed. Turns out it does
> show up sometimes, latest result I found is 2017 vintage:
> https://groups.google.com/g/syzkaller-bugs/c/jfQ7upCDf9s/m/RQjhDrZ7AQAJ
>
> So you are moving this to another locked area, but one which does not
> execute in the benchmark?
The consideration here as mentioned is to reduce the performance impact
(if to reserve the sanity check, and have the same functionality) by
moving it from critical path to non-critical, as put_unused_fd() is
mostly used for error handling when fd is allocated successfully but
struct file failed to obtained in the next step.
>
> Patch 2/3 states 28% read and 14% write increase, this commit message
> claims it goes up to 32% and 15% respectively -- pretty big. I presume
> this has to do with bouncing a line containing the fd.
>
> I would argue moving this check elsewhere is about as good as removing
> it altogether, but that's for the vfs overlords to decide
>
> All that aside, looking at disasm of alloc_fd it is pretty clear there
> is time to save, for example:
>
> if (unlikely(nr >= fdt->max_fds)) {
> if (fd >= end) {
> error = -EMFILE;
> goto out;
> }
> error = expand_files(fd, fd);
> if (error < 0)
> goto out;
> if (error)
> goto repeat;
> }
>
> This elides 2 branches and a func call in the common case. Completely
> untested, maybe has some brainfarts, feel free to take without credit
> and further massage the routine.
>
> Moreover my disasm shows that even looking for a bit results in
> a func call(!) to _find_next_zero_bit -- someone(tm) should probably
> massage it into another inline.
>
> After the above massaging is done and if it turns out the check has to
> stay, you can plausibly damage-control it with prefetch -- issue it
> immediately after finding the fd number, before any other work.
>
> All that said, by the above I'm confident there is still *some*
> performance left on the table despite the lock.
Thank you Guzik for such quick check and good suggestions :) Yes, there
are extra branches and func call can be reduced for better performance,
considering the fix for fast path, how about flow as below draft
(besides the massage to find_next_fd()):
error = -EMFILE;
if (fd < fdt->max_fds) {
if (~fdt->open_fds[0]) {
fd = find_next_zero_bit(fdt->open_fds,
BITS_PER_LONG, fd);
goto fastreturn;
}
fd = find_next_fd(fdt, fd);
}
if (unlikely(fd >= fdt->max_fds)) {
error = expand_files(files, fd);
if (error < 0)
goto out;
if (error)
goto repeat;
}
fastreturn:
if (unlikely(fd >= end))
goto out;
if (start <= files->next_fd)
files->next_fd = fd + 1;
....
>> out:
>> spin_unlock(&files->file_lock);
>> @@ -572,7 +565,7 @@ int get_unused_fd_flags(unsigned flags)
>> }
>> EXPORT_SYMBOL(get_unused_fd_flags);
>>
>> -static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>> +static inline void __put_unused_fd(struct files_struct *files, unsigned int fd)
>> {
>> struct fdtable *fdt = files_fdtable(files);
>> __clear_open_fd(fd, fdt);
>> @@ -583,7 +576,12 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
>> void put_unused_fd(unsigned int fd)
>> {
>> struct files_struct *files = current->files;
>> + struct fdtable *fdt = files_fdtable(files);
>> spin_lock(&files->file_lock);
>> + if (unlikely(rcu_access_pointer(fdt->fd[fd]))) {
>> + printk(KERN_WARNING "put_unused_fd: slot %d not NULL!\n", fd);
>> + rcu_assign_pointer(fdt->fd[fd], NULL);
>> + }
>> __put_unused_fd(files, fd);
>> spin_unlock(&files->file_lock);
>> }
On 6/15/2024 2:31 PM, Mateusz Guzik wrote:
> On Fri, Jun 14, 2024 at 12:34:14PM -0400, Yu Ma wrote:
>> There is available fd in the lower 64 bits of open_fds bitmap for most cases
>> when we look for an available fd slot. Skip 2-levels searching via
>> find_next_zero_bit() for this common fast path.
>>
>> Look directly for an open bit in the lower 64 bits of open_fds bitmap when a
>> free slot is available there, as:
>> (1) The fd allocation algorithm would always allocate fd from small to large.
>> Lower bits in open_fds bitmap would be used much more frequently than higher
>> bits.
>> (2) After fdt is expanded (the bitmap size doubled for each time of expansion),
>> it would never be shrunk. The search size increases but there are few open fds
>> available here.
>> (3) There is fast path inside of find_next_zero_bit() when size<=64 to speed up
>> searching.
>>
>> With the fast path added in alloc_fd() through one-time bitmap searching,
>> pts/blogbench-1.1.0 read is improved by 20% and write by 10% on Intel ICX 160
>> cores configuration with v6.8-rc6.
>>
>> Reviewed-by: Tim Chen <[email protected]>
>> Signed-off-by: Yu Ma <[email protected]>
>> ---
>> fs/file.c | 9 +++++++--
>> 1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/file.c b/fs/file.c
>> index 3b683b9101d8..e8d2f9ef7fd1 100644
>> --- a/fs/file.c
>> +++ b/fs/file.c
>> @@ -510,8 +510,13 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> if (fd < files->next_fd)
>> fd = files->next_fd;
>>
>> - if (fd < fdt->max_fds)
>> + if (fd < fdt->max_fds) {
>> + if (~fdt->open_fds[0]) {
>> + fd = find_next_zero_bit(fdt->open_fds, BITS_PER_LONG, fd);
>> + goto success;
>> + }
>> fd = find_next_fd(fdt, fd);
>> + }
>>
>> /*
>> * N.B. For clone tasks sharing a files structure, this test
>> @@ -531,7 +536,7 @@ static int alloc_fd(unsigned start, unsigned end, unsigned flags)
>> */
>> if (error)
>> goto repeat;
>> -
>> +success:
>> if (start <= files->next_fd)
>> files->next_fd = fd + 1;
>>
> As indicated in my other e-mail it may be a process can reach a certain
> fd number and then lower its rlimit(NOFILE). In that case the max_fds
> field can happen to be higher and the above patch will fail to check for
> the (fd < end) case.
Thanks for the good catch, replied in that mail thread for details.
>
>> --
>> 2.43.0
>>