2005-09-14 18:31:39

by David Miller

[permalink] [raw]
Subject: [PATCH]: Brown paper bag in fs/file.c?


While tracking down a sparc64 bug, I spotted what seems to be a bug in
__free_fdtable(). (the sparc64 bug is that a user page gets
corrupted, and it's full of kernel filp pointers, I've verified that
the page mapped there is no longer in the SLAB cache, but the filp's
are still active in the SLAB, I've also ruled out cache aliasing and
tlb flushing bugs)

The bug is that free_fd_array() takes a "num" argument, but when
calling it from __free_fdtable() we're instead passing in the size in
bytes (ie. "num * sizeof(struct file *)").

How come this doesn't crash things for people? Perhaps I'm missing
something. fs/vmalloc.c should bark very loudly if we call it with a
non-vmalloc area address, since that is what would happen if we pass a
kmalloc() SLAB object address to vfree().

I think I know what might be happening. If the miscalculation means
that we kfree() the embedded fdarray, that would actually work just
fine, and free up the fdtable. I guess if the SLAB redzone stuff were
enabled for these caches, it would trigger when something like this
happens.

BTW, in mm/slab.c for FORCED_DEBUG we have:

if ((size < 4096 || fls(size-1) == fls(size-1+3*BYTES_PER_WORD)))
flags |= SLAB_RED_ZONE|SLAB_STORE_USER;

Isn't PAGE_SIZE intended here, not the constant "4096"? Just curious.

I'm about to reboot to see if this fixes the sparc64 problem, with my
luck it probably won't :-)

diff --git a/fs/file.c b/fs/file.c
--- a/fs/file.c
+++ b/fs/file.c
@@ -75,7 +75,7 @@ static void __free_fdtable(struct fdtabl
fdarray_size = fdt->max_fds * sizeof(struct file *);
free_fdset(fdt->open_fds, fdset_size);
free_fdset(fdt->close_on_exec, fdset_size);
- free_fd_array(fdt->fd, fdarray_size);
+ free_fd_array(fdt->fd, fdt->max_fds);
kfree(fdt);
}


2005-09-14 18:48:23

by David Miller

[permalink] [raw]
Subject: Re: [PATCH]: Brown paper bag in fs/file.c?

From: "David S. Miller" <[email protected]>
Date: Wed, 14 Sep 2005 11:31:33 -0700 (PDT)

> I'm about to reboot to see if this fixes the sparc64 problem, with my
> luck it probably won't :-)

It doesn't, but I think the patch is correct regardless.

2005-09-14 19:05:25

by David Miller

[permalink] [raw]
Subject: Re: [PATCH]: Brown paper bag in fs/file.c?

From: "David S. Miller" <[email protected]>
Date: Wed, 14 Sep 2005 11:48:21 -0700 (PDT)

> From: "David S. Miller" <[email protected]>
> Date: Wed, 14 Sep 2005 11:31:33 -0700 (PDT)
>
> > I'm about to reboot to see if this fixes the sparc64 problem, with my
> > luck it probably won't :-)
>
> It doesn't, but I think the patch is correct regardless.

Studying fs/file.c a bit more, there seems to be several of these
"size vs nr" errors in the allocation and freeing code.

What's going on here?

2005-09-14 19:24:23

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH]: Brown paper bag in fs/file.c?

On Wed, Sep 14, 2005 at 11:31:33AM -0700, David S. Miller wrote:
>
> The bug is that free_fd_array() takes a "num" argument, but when
> calling it from __free_fdtable() we're instead passing in the size in
> bytes (ie. "num * sizeof(struct file *)").

Yes it is a bug. I think I messed it up while merging newer
changes with an older version where I was using size in bytes
to optimize.

>
> How come this doesn't crash things for people? Perhaps I'm missing
> something. fs/vmalloc.c should bark very loudly if we call it with a
> non-vmalloc area address, since that is what would happen if we pass a
> kmalloc() SLAB object address to vfree().
>
> I think I know what might be happening. If the miscalculation means
> that we kfree() the embedded fdarray, that would actually work just
> fine, and free up the fdtable. I guess if the SLAB redzone stuff were
> enabled for these caches, it would trigger when something like this
> happens.

__free_fdtable() is used only when the fdarray/fdset are vmalloced
(use of the workqueue) or there is a race between two expand_files().
That might be why we haven't seen this cause any explicit problem
so far.

This would be an appropriate patch - (untested). I will update
as soon as testing is done.

Thanks
Dipankar



Fixes the fdtable freeing in the case of vmalloced fdset/arrays.

Signed-off-by: Dipankar Sarma <[email protected]>
--


fs/file.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)

diff -puN fs/file.c~files-fix-fdtable-free fs/file.c
--- linux-2.6.14-rc1-fd/fs/file.c~files-fix-fdtable-free 2005-09-15 00:36:03.000000000 +0530
+++ linux-2.6.14-rc1-fd-dipankar/fs/file.c 2005-09-15 00:39:46.000000000 +0530
@@ -69,13 +69,9 @@ void free_fd_array(struct file **array,

static void __free_fdtable(struct fdtable *fdt)
{
- int fdset_size, fdarray_size;
-
- fdset_size = fdt->max_fdset / 8;
- fdarray_size = fdt->max_fds * sizeof(struct file *);
- free_fdset(fdt->open_fds, fdset_size);
- free_fdset(fdt->close_on_exec, fdset_size);
- free_fd_array(fdt->fd, fdarray_size);
+ free_fdset(fdt->open_fds, fdt->max_fdset);
+ free_fdset(fdt->close_on_exec, fdt->max_fdset);
+ free_fd_array(fdt->fd, fdt->max_fds);
kfree(fdt);
}


_

2005-09-14 19:57:55

by David Miller

[permalink] [raw]
Subject: Re: [PATCH]: Brown paper bag in fs/file.c?

From: Dipankar Sarma <[email protected]>
Date: Thu, 15 Sep 2005 00:48:42 +0530

> __free_fdtable() is used only when the fdarray/fdset are vmalloced
> (use of the workqueue) or there is a race between two expand_files().
> That might be why we haven't seen this cause any explicit problem
> so far.
>
> This would be an appropriate patch - (untested). I will update
> as soon as testing is done.

Thanks.

I still can't figure out what causes my sparc64 bug. Somehow a
kmalloc() chunk of file pointers gets freed too early, the SLAB is
shrunk due to memory pressure so the page containing that object gets
freed, that page ends up as an anonymous page in userspace, but filp
writes from the older usage occurs and corrupts the page.

I wonder if we simply leave a stale pointer around to the older
fd array in some case.

2005-09-14 20:21:30

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH]: Brown paper bag in fs/file.c?

On Wed, Sep 14, 2005 at 12:57:50PM -0700, David S. Miller wrote:
> From: Dipankar Sarma <[email protected]>
> Date: Thu, 15 Sep 2005 00:48:42 +0530
>
> > __free_fdtable() is used only when the fdarray/fdset are vmalloced
> > (use of the workqueue) or there is a race between two expand_files().
> > That might be why we haven't seen this cause any explicit problem
> > so far.
> >
> > This would be an appropriate patch - (untested). I will update
> > as soon as testing is done.
>
> Thanks.
>
> I still can't figure out what causes my sparc64 bug. Somehow a
> kmalloc() chunk of file pointers gets freed too early, the SLAB is
> shrunk due to memory pressure so the page containing that object gets
> freed, that page ends up as an anonymous page in userspace, but filp
> writes from the older usage occurs and corrupts the page.
>
> I wonder if we simply leave a stale pointer around to the older
> fd array in some case.

Are you running with preemption enabled ? If so, fyi, I had sent
out a patch earlier that fixes locking for preemption.
Also, what triggers this in your machine ? I can try to reproduce
this albeit on a non-sparc64 box.

Thanks
Dipankar

2005-09-14 20:29:44

by David Miller

[permalink] [raw]
Subject: Re: [PATCH]: Brown paper bag in fs/file.c?

From: Dipankar Sarma <[email protected]>
Date: Thu, 15 Sep 2005 01:45:50 +0530

> Are you running with preemption enabled ? If so, fyi, I had sent
> out a patch earlier that fixes locking for preemption.
> Also, what triggers this in your machine ? I can try to reproduce
> this albeit on a non-sparc64 box.

No PREEMPT enabled.

I believe this bug has been around since before your RCU
changes, I've been seeing it for about half a year.

My test case is, on a uniprocessor with 1GB of ram, run the
following in parallel:

1) In one window, do a plain kernel build of 2.6.x

2) In another window, in a temporary directory, run this shell code:

while [ 1 ]
do
rm -rf .git *
cp -a ../linux-2.6/.git .
git-checkout-cache -a
git-update-cache --refresh
echo "Finished one pass..."
done

Adjust the path to a vanilla 2.6.x ".git" directory in #2 as needed.

Sometimes this can trigger in less than a minute. Other times it
takes 5 or 6 minutes, but other than this variance it is rather
reliable.

I think the key is to make sure you don't have any more than 1GB of
ram for this on a 64-bit box, so that there is sufficient memory
pressure from all of the ext3 writebacks and other fs activity in
order to shrink the file table SLABs and thus liberate the pages.

2005-09-14 21:17:49

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] reorder struct files_struct

--- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200
+++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200
@@ -34,12 +34,12 @@
*/
struct files_struct {
atomic_t count;
- spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */
struct fdtable *fdt;
struct fdtable fdtab;
fd_set close_on_exec_init;
fd_set open_fds_init;
struct file * fd_array[NR_OPEN_DEFAULT];
+ spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */
};

#define files_fdtable(files) (rcu_dereference((files)->fdt))


Attachments:
reorder_files_struct (660.00 B)

2005-09-14 21:42:22

by Peter Staubach

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

Eric Dumazet wrote:

> Hi
>
> Browsing (and using) the excellent RCU infrastructure for files that
> was adopted for 2.6.14-rc1, I noticed that the file_lock spinlock sit
> close to mostly read fields of 'struct files_struct'
>
> In SMP (and NUMA) environnements, each time a thread wants to open or
> close a file, it has to acquire the spinlock, thus invalidating the
> cache line containing this spinlock on other CPUS. So other threads
> doing read()/write()/... calls that use RCU to access the file table
> are going to ask further memory (possibly NUMA) transactions to read
> again this memory line.
>
> Please consider applying this patch. It moves the spinlock to another
> cache line, so that concurrent threads can share the cache line
> containing 'count' and 'fdt' fields.


How was the performance impact of this change measured?

Thanx...

ps

2005-09-14 22:07:38

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

On Wed, Sep 14, 2005 at 11:17:42PM +0200, Eric Dumazet wrote:
> In SMP (and NUMA) environnements, each time a thread wants to open or close
> a file, it has to acquire the spinlock, thus invalidating the cache line
> containing this spinlock on other CPUS. So other threads doing
> read()/write()/... calls that use RCU to access the file table are going to
> ask further memory (possibly NUMA) transactions to read again this memory
> line.
>
> Please consider applying this patch. It moves the spinlock to another cache
> line, so that concurrent threads can share the cache line containing
> 'count' and 'fdt' fields.
>
> --- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200
> +++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200
> @@ -34,12 +34,12 @@
> */
> struct files_struct {
> atomic_t count;
> - spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */
> struct fdtable *fdt;
> struct fdtable fdtab;
> fd_set close_on_exec_init;
> fd_set open_fds_init;
> struct file * fd_array[NR_OPEN_DEFAULT];
> + spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */
> };
>
> #define files_fdtable(files) (rcu_dereference((files)->fdt))

For most apps without too many open fds, the embedded fd_sets
are going to be used. Wouldn't that mean that open()/close() will
invalidate the cache line containing fdt, fdtab by updating
the fd_sets ? If so, you optimization really doesn't help.


Thanks
Dipankar

2005-09-14 22:17:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

Dipankar Sarma <[email protected]> wrote:
>
> > --- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200
> > +++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200
> > @@ -34,12 +34,12 @@
> > */
> > struct files_struct {
> > atomic_t count;
> > - spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */
> > struct fdtable *fdt;
> > struct fdtable fdtab;
> > fd_set close_on_exec_init;
> > fd_set open_fds_init;
> > struct file * fd_array[NR_OPEN_DEFAULT];
> > + spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */
> > };
> >
> > #define files_fdtable(files) (rcu_dereference((files)->fdt))
>
> For most apps without too many open fds, the embedded fd_sets
> are going to be used. Wouldn't that mean that open()/close() will
> invalidate the cache line containing fdt, fdtab by updating
> the fd_sets ? If so, you optimization really doesn't help.

Guys, this is benchmarkable. fget() is astonishingly high in some
profiles - it's worth investigating.

2005-09-14 22:42:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

Dipankar Sarma a ?crit :
> On Wed, Sep 14, 2005 at 11:17:42PM +0200, Eric Dumazet wrote:
>
>>In SMP (and NUMA) environnements, each time a thread wants to open or close
>>a file, it has to acquire the spinlock, thus invalidating the cache line
>>containing this spinlock on other CPUS. So other threads doing
>>read()/write()/... calls that use RCU to access the file table are going to
>>ask further memory (possibly NUMA) transactions to read again this memory
>>line.
>>
>>Please consider applying this patch. It moves the spinlock to another cache
>>line, so that concurrent threads can share the cache line containing
>>'count' and 'fdt' fields.
>>
>>--- linux-2.6.14-rc1/include/linux/file.h 2005-09-13 05:12:09.000000000 +0200
>>+++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15 01:09:13.000000000 +0200
>>@@ -34,12 +34,12 @@
>> */
>> struct files_struct {
>> atomic_t count;
>>- spinlock_t file_lock; /* Protects all the below members. Nests inside tsk->alloc_lock */
>> struct fdtable *fdt;
>> struct fdtable fdtab;
>> fd_set close_on_exec_init;
>> fd_set open_fds_init;
>> struct file * fd_array[NR_OPEN_DEFAULT];
>>+ spinlock_t file_lock; /* Protects concurrent writers. Nests inside tsk->alloc_lock */
>> };
>>
>> #define files_fdtable(files) (rcu_dereference((files)->fdt))
>
>
> For most apps without too many open fds, the embedded fd_sets
> are going to be used. Wouldn't that mean that open()/close() will
> invalidate the cache line containing fdt, fdtab by updating
> the fd_sets ? If so, you optimization really doesn't help.
>

If the embedded struct fdtable is used, then the only touched field is
'next_fd', so we could also move this field at the end of 'struct fdtable'

But I wonder if 'next_fd' really has to be in 'struct fdtable', maybe it could
be moved to 'struct files_struct' close to file_lock ?

If yes, the whole embedded struct fdtable is readonly.

Eric

2005-09-14 22:57:44

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

On Thu, Sep 15, 2005 at 12:42:03AM +0200, Eric Dumazet wrote:
> Dipankar Sarma a ?crit :
> >On Wed, Sep 14, 2005 at 11:17:42PM +0200, Eric Dumazet wrote:
> >
> >>--- linux-2.6.14-rc1/include/linux/file.h 2005-09-13
> >>05:12:09.000000000 +0200
> >>+++ linux-2.6.14-rc1-ed/include/linux/file.h 2005-09-15
> >>01:09:13.000000000 +0200
> >>@@ -34,12 +34,12 @@
> >> */
> >>struct files_struct {
> >> atomic_t count;
> >>- spinlock_t file_lock; /* Protects all the below members.
> >>Nests inside tsk->alloc_lock */
> >> struct fdtable *fdt;
> >> struct fdtable fdtab;
> >> fd_set close_on_exec_init;
> >> fd_set open_fds_init;
> >> struct file * fd_array[NR_OPEN_DEFAULT];
> >>+ spinlock_t file_lock; /* Protects concurrent writers. Nests
> >>inside tsk->alloc_lock */
> >>};
> >>
> >>#define files_fdtable(files) (rcu_dereference((files)->fdt))
> >
> >
> >For most apps without too many open fds, the embedded fd_sets
> >are going to be used. Wouldn't that mean that open()/close() will
> >invalidate the cache line containing fdt, fdtab by updating
> >the fd_sets ? If so, you optimization really doesn't help.
> >
>
> If the embedded struct fdtable is used, then the only touched field is
> 'next_fd', so we could also move this field at the end of 'struct fdtable'
>

Not just embedded fdtable, but also the embedded fdsets. I would expect
count, fdt, fdtab and the fdsets to fit into one cache line in
some archs.


> But I wonder if 'next_fd' really has to be in 'struct fdtable', maybe it
> could be moved to 'struct files_struct' close to file_lock ?

next_fd has to be in struct fdtable. It needs to be consistent
with whichever fdtable a lock-free reader sees.

>
> If yes, the whole embedded struct fdtable is readonly.

But not close_on_exec_init or open_fds_init. We would update them
on open/close.

Some benchmarking would be useful here.

Thanks
Dipankar

2005-09-14 23:19:53

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

/*
* Bench program to exercice multi threads using open()/close()/read()/lseek() calls.
* Usage :
* bench [-t XX] [-l len]
* XX : number of threads
* len : bench time in seconds
* -s : small fdset : try to use embedded sruct fdtable
*/
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER;

char *file = "/etc/passwd";
int sflag; /* small fdset */
int end_prog;
unsigned long work_done;

void *perform_work(void *arg)
{
int fd, i;
unsigned long units = 0;
char c;

/* force this program to open more than 64 fds */
if (!sflag)
for (i = 0 ; i < 64 ; i++)
open("/dev/null", O_RDONLY);

while (!end_prog) {
fd = open(file, O_RDONLY);
read(fd, &c, 1);
lseek(fd, 10, SEEK_SET);
read(fd, &c, 1);
lseek(fd, 20, SEEK_SET);
read(fd, &c, 1);
lseek(fd, 30, SEEK_SET);
read(fd, &c, 1);
lseek(fd, 40, SEEK_SET);
read(fd, &c, 1);
close(fd);
units++;
}
pthread_mutex_lock(&mut);
work_done += units;
pthread_mutex_unlock(&mut);
return 0;
}

void usage(int code)
{
fprintf(stderr, "Usage : bench [-s] [-t threads] [-l duration]\n");
exit(code);
}

int main(int argc, char *argv[])
{
int i, c;
int nbthreads = 2;
unsigned int length = 10;
pthread_t *tid;

while ((c = getopt(argc, argv, "st:l:")) != -1) {
if (c == 't')
nbthreads = atoi(optarg);
else if (c == 'l')
length = atoi(optarg);
else if (c == 's')
sflag = 1;
else usage(1);
}


tid = malloc(nbthreads*sizeof(pthread_t));
for (i = 0 ; i < nbthreads; i++)
pthread_create(tid + i, NULL, perform_work, NULL);

sleep(length);
end_prog = 1;
for (i = 0 ; i < nbthreads; i++)
pthread_join(tid[i], NULL);

pthread_mutex_lock(&mut);
printf("%d threads, %u seconds, work_done=%lu\n", nbthreads, length, work_done);
pthread_mutex_unlock(&mut);
return 0;
}


Attachments:
bench.c (1.83 kB)

2005-09-15 05:00:42

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

On Thu, Sep 15, 2005 at 01:19:47AM +0200, Eric Dumazet wrote:
> Dipankar Sarma a ?crit :
> >On Thu, Sep 15, 2005 at 12:42:03AM +0200, Eric Dumazet wrote:
> >
> >>Dipankar Sarma a ?crit :
>
> >>If yes, the whole embedded struct fdtable is readonly.
> >
> >
> >But not close_on_exec_init or open_fds_init. We would update them
> >on open/close.
>
> Yes, sure, but those fields are not part of the embedded struct fdtable

Those fdsets would share a cache line with fdt, fdtable which would
be invalidated on open/close. So, what is the point in moving
file_lock ?

Thanks
Dipankar

2005-09-15 06:17:48

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

Dipankar Sarma a ?crit :
> On Thu, Sep 15, 2005 at 01:19:47AM +0200, Eric Dumazet wrote:
>
>>Dipankar Sarma a ?crit :
>>
>>>On Thu, Sep 15, 2005 at 12:42:03AM +0200, Eric Dumazet wrote:
>>>
>>>
>>>>Dipankar Sarma a ?crit :
>>
>>>>If yes, the whole embedded struct fdtable is readonly.
>>>
>>>
>>>But not close_on_exec_init or open_fds_init. We would update them
>>>on open/close.
>>
>>Yes, sure, but those fields are not part of the embedded struct fdtable
>
>
> Those fdsets would share a cache line with fdt, fdtable which would
> be invalidated on open/close. So, what is the point in moving
> file_lock ?
>

The point is that we gain nothing in this case for 32 bits platforms, but we
gain something on 64 bits platform. And for apps using more than
NR_OPEN_DEFAULT files we definitly win on both 32bits/64bits.

Maybe moving file_lock just before close_on_exec_init would be a better
choice, so that 32bits platform small apps touch one cache line instead of two.

sizeof(struct fdtable) = 40/0x28 on 32bits, 72/0x48 on 64 bits

struct files_struct {

/* mostly read */
atomic_t count; /* offset 0x00 */
struct fdtable *fdt; /* offset 0x04 (0x08 on 64 bits) */
struct fdtable fdtab; /* offset 0x08 (0x10 on 64 bits)*/

/* read/written for apps using small number of files */
fd_set close_on_exec_init; /* offset 0x30 (0x58 on 64 bits) */
fd_set open_fds_init; /* offset 0x34 (0x60 on 64 bits) */
struct file * fd_array[NR_OPEN_DEFAULT]; /* offset 0x38 (0x68 on 64 bits */
spinlock_t file_lock; /* 0xB8 (0x268 on 64 bits) */
}; /* size = 0xBC (0x270 on 64 bits) */

Moving next_fd from 'struct fdtable' to 'struct files_struct' is also a win
for 64bits platforms since sizeof(struct fdtable) become 64 : a nice power of
two, so 64 bytes are allocated instead of 128.

Eric

2005-09-15 09:41:04

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

On Thu, Sep 15, 2005 at 08:17:40AM +0200, Eric Dumazet wrote:
> Dipankar Sarma a ?crit :
> >On Thu, Sep 15, 2005 at 01:19:47AM +0200, Eric Dumazet wrote:
> >
> >Those fdsets would share a cache line with fdt, fdtable which would
> >be invalidated on open/close. So, what is the point in moving
> >file_lock ?
> >
>
> The point is that we gain nothing in this case for 32 bits platforms, but
> we gain something on 64 bits platform. And for apps using more than

I am not sure about that. IIRC, x86_64 has a 128-byte L1 cacheline.
So, count, fdt, fdtab, close_on_exec_init and open_fds_init would
all fit into one cache line. And close_on_exec_init will get updated
on open(). Also, most apps will not likely have more than the
default # of fds, it might not be a good idea to optimize for
that case.


> struct files_struct {
>
> /* mostly read */
> atomic_t count; /* offset 0x00 */
> struct fdtable *fdt; /* offset 0x04 (0x08 on 64 bits) */
> struct fdtable fdtab; /* offset 0x08 (0x10 on 64 bits)*/
>
> /* read/written for apps using small number of files */
> fd_set close_on_exec_init; /* offset 0x30 (0x58 on 64 bits) */
> fd_set open_fds_init; /* offset 0x34 (0x60 on 64 bits) */
> struct file * fd_array[NR_OPEN_DEFAULT]; /* offset 0x38 (0x68 on 64
> bits */
> spinlock_t file_lock; /* 0xB8 (0x268 on 64 bits) */
> }; /* size = 0xBC (0x270 on 64 bits) */
>
> Moving next_fd from 'struct fdtable' to 'struct files_struct' is also a win
> for 64bits platforms since sizeof(struct fdtable) become 64 : a nice power
> of two, so 64 bytes are allocated instead of 128.

Can you benchmark this on a higher end SMP/NUMA system ?

Thanks
Dipankar

2005-09-15 17:11:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] reorder struct files_struct

/*
* Bench program to exercice multi threads using open()/close()/read()/lseek() calls.
* Usage :
* bench [-t XX] [-l len]
* XX : number of threads
* len : bench time in seconds
* -s : small fdset : try to use embedded sruct fdtable
*/
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>
#include <sched.h>
#include <errno.h>
#include <signal.h>

static pthread_mutex_t mut = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t cond = PTHREAD_COND_INITIALIZER;

static int sflag; /* small fdset */
static int end_prog;
static unsigned long work_done;
static char onekilo[1024];

static void catch_alarm(int sig)
{
end_prog = 1;
}

static void *perform_work(void *arg)
{
int fd, i;
unsigned long units = 0;
int fds[64];
char filename[64];
char c;
strcpy(filename, "/tmp/benchXXXXXX");
fd = mkstemp(filename);
write(fd, onekilo, sizeof(onekilo));
close(fd);

/* force this program to open more than 64 fds */
if (!sflag)
for (i = 0 ; i < 64 ; i++)
fds[i] = open("/dev/null", O_RDONLY);

while (!end_prog) {
fd = open(filename, O_RDONLY);
read(fd, &c, 1);
lseek(fd, 10, SEEK_SET);
read(fd, &c, 1);
lseek(fd, 20, SEEK_SET);
read(fd, &c, 1);
lseek(fd, 30, SEEK_SET);
read(fd, &c, 1);
lseek(fd, 40, SEEK_SET);
read(fd, &c, 1);
close(fd);
units++;
}
unlink(filename);
if (!sflag)
for (i = 0 ; i < 64 ; i++)
close(fds[i]);
pthread_mutex_lock(&mut);
work_done += units;
pthread_mutex_unlock(&mut);
return 0;
}

static void *idle_thread(void *arg)
{
pthread_mutex_lock(&mut);
while (!end_prog) {
pthread_cond_wait(&cond, &mut);
}
pthread_mutex_unlock(&mut);
}

static void usage(int code)
{
fprintf(stderr, "Usage : bench [-i] [-s] [-a affinity_mask] [-t threads] [-l duration]\n");
exit(code);
}

int main(int argc, char *argv[])
{
int i, c;
int nbthreads = 2;
int iflag = 0;
unsigned int length = 10;
pthread_t *tid;
struct sigaction sg;
struct timeval t0, t1;
long mask = 0;

while ((c = getopt(argc, argv, "sit:l:a:")) != -1) {
if (c == 't')
nbthreads = atoi(optarg);
else if (c == 'l')
length = atoi(optarg);
else if (c == 's')
sflag = 1;
else if (c == 'i')
iflag = 1;
else if (c == 'a')
sscanf(optarg, "%li", &mask);
else usage(1);
}
if (mask != 0) {
int res = sched_setaffinity(0, &mask);
if (res != 0)
fprintf(stderr, "sched_affinity(0x%lx)->%d errno=%d\n", mask, res, errno);
}


tid = malloc(nbthreads*sizeof(pthread_t));
gettimeofday(&t0, NULL);
for (i = 1 ; i < nbthreads; i++)
pthread_create(tid + i, NULL, perform_work, NULL);
if (iflag)
pthread_create(tid, NULL, idle_thread, NULL);
memset(&sg, 0, sizeof(sg));
sg.sa_handler = catch_alarm;
sigaction(SIGALRM, &sg, NULL);
alarm(length);
perform_work(NULL);

if (iflag) {
pthread_cond_signal(&cond);
pthread_join(tid[0], NULL);
}
for (i = 1 ; i < nbthreads; i++)
pthread_join(tid[i], NULL);

gettimeofday(&t1, NULL);
t1.tv_sec -= t0.tv_sec;
t1.tv_usec -= t0.tv_usec;
if (t1.tv_usec < 0) {
t1.tv_usec += 1000000;
t1.tv_sec--;
}
pthread_mutex_lock(&mut);
printf("%d threads, %d.%06d seconds, work_done=%lu\n",
nbthreads, (int)t1.tv_sec, (int)t1.tv_usec, work_done);
pthread_mutex_unlock(&mut);
return 0;
}


Attachments:
bench.c (3.23 kB)

2005-09-15 21:06:17

by David Miller

[permalink] [raw]
Subject: Re: [PATCH]: Brown paper bag in fs/file.c?

From: Dipankar Sarma <[email protected]>
Date: Thu, 15 Sep 2005 01:45:50 +0530

> Are you running with preemption enabled ? If so, fyi, I had sent
> out a patch earlier that fixes locking for preemption.
> Also, what triggers this in your machine ? I can try to reproduce
> this albeit on a non-sparc64 box.

Dipankar, don't spend too much effort trying to reproduce my
problem. I'm back to thinking I might have some kind of sparc64
specific bug on my hands.

Thanks for all of your help.