2001-04-09 14:41:43

by Maneesh Soni

[permalink] [raw]
Subject: [RFC][PATCH] Scalable FD Management using Read-Copy-Update


Scalable FD Management Using Read-Copy-Update
---------------------------------------------

This patch provides a very good performance improvement in file
descriptor management for SMP linux kernel on a 4-way machine with the
expectation of even higher gains on higher end machines. The patch uses the
read-copy-update mechanism for Linux, published earlier at the sourceforge site
under Linux Scalablity Effort project.
http://lse.sourceforge.net/locking/rclock.html.

In SMP kernel the performance is limited due to reader-writer lock taken during
various calls using files_struct. Majority being in the routine fget(). Though
there is no severe contention for files->file_lock as the files_struct is a
per task data structure but enough performance penalties have to be paid while
even acquairing the read lock due to the bouncing lock cache line when multiple
clones share the same files_struct. This was pointed out by John Hawkes in his
posting to lse-tech mailing list
http://marc.theaimsgroup.com/?l=lse-tech&m=98235007317770&w=2

The improvement in performance while runnig "chat" benchmark
(from http://lbs.sourceforge.net/) is about 30% in average throughput.
For both configurations the results are compared for base kernel(2.4.2)
and base kernel with files_struct_rcu-2.4.2-0.1.patch. Profiling results were
also collected by using SGI's kernprof utility and that shows a considerable
decrease in amount of time spent in fget(). The "chat" benchmark was run with
rooms=20 and messages=500. For each configuration, the test was run for 50 times
and average of throughput result in terms of messages per second was taken.

1. 4-way PIII Xeon 700 MHz with 1MB L2 Cache and 1GB RAM
========================================================

Chat benchmark results
----------------------
Kernel version Average
Throughput

2.4.2 191986
2.4.2+file_struct_rcu-2.4.2-0.1.patch 253083

Improvement = 31.8%

kernprof results
---------------

Kernel Version - 2.4.2

default_idle [C01071EC]: 150696
schedule [C0112EE4]: 105452
__wake_up [C0113518]: 74030
tcp_sendmsg [C020FB14]: 29201
fget [C013436C]: 16318
__generic_copy_to_user [C023A13C]: 15477
USER [C0121DF4]: 12925
tcp_recvmsg [C0210A68]: 7737
system_call [C0109150]: 7399
mcount [C023A4E4]: 5509

Kernel version - 2.4.2+files_struct_rcu-2.4.2+0.1.patch

schedule [C0113174]: 101392
__wake_up [C01137D4]: 68182
default_idle [C01071EC]: 32833
tcp_sendmsg [C021ECE4]: 29318
__generic_copy_to_user [C024930C]: 15472
USER [C0122A00]: 12803
tcp_recvmsg [C021FC38]: 8170
system_call [C0109150]: 7636
mcount [C02496B4]: 6150
fget [C0134F58]: 5694

With this patch the routine fget() gets about 65% less hits.

2. 2-way PIII Xeon 700 MHz with 1MB L2 Cache and 1GB RAM
========================================================

Chat benchmark results
----------------------
Kernel version Average
Throughput

2.4.2 209592
2.4.2+file_struct_rcu-2.4.2-0.1.patch 222729

Improvement = 6.2%

kernprof results
-----------------
Kernel Version - 2.4.2

schedule [C0112EE4]: 37583
tcp_sendmsg [C020FB14]: 23381
default_idle [C01071EC]: 20025
__wake_up [C0113518]: 16141
fget [C013436C]: 12733
USER [C0121DF4]: 12474
__generic_copy_to_user [C023A13C]: 12046
tcp_recvmsg [C0210A68]: 7678
system_call [C0109150]: 7113
mcount [C023A4E4]: 4789

Kernel version - 2.4.2+files_struct_rcu-2.4.2+0.1.patch

default_idle [C01071EC]: 39086
schedule [C0113174]: 37278
tcp_sendmsg [C021ECE4]: 23148
__wake_up [C01137D4]: 15788
USER [C0122A00]: 12322
__generic_copy_to_user [C024930C]: 12138
tcp_recvmsg [C021FC38]: 7615
system_call [C0109150]: 7481
mcount [C02496B4]: 5310
fget [C0134F58]: 4858

The results with 4-way are much better than 2-way, which shows the need for
this patch even more necessary for higher end SMP systems. Results for 8-way
will be published soon.

With this patch the updates to the files_struct, mainly file expanding the
fd_array and expanding bitmaps, are done using callback mechanism. In the
routine expand_fd_array() once the new array has been allocated, updated with
the old entries and is linked to files_struct, a callback is registered to free
the old fd_array, so that if there are any existing users present in the kernel,
they will keepon using the old fd_array, and for a new user will see the
expanded fd_array. Once quiescent state is reached the callback is fired and the
old fd_array is freed. On similar lines the routine expand_fd_set has been
modified to use the callback to free old bitmap.

Known Race condition
--------------------
There could be a race condition involving sys_close() and fget() if both are
running for the same files_sturct on different CPUs. Though this problem has not
occured so far but this is always possible theoritically and it is being worked
out. The patch with solution for this will soon follow.


Usage Information
-----------------
The patch is built on 2.4.2 kernel. Before applying this patch, the user has to
obviously apply the read-copy-update patch (rclock-2.4.2-0.1.patch) for linux,
which can be obtained from
http://lse.sourceforge.net/locking/rclock.html.

The config options required for this patch are CONFIG_RCLOCK and CONFIG_FD_RCU.




Regards,
Maneesh

--
Maneesh Soni <[email protected]>
IBM Linux Technology Center,
IBM Software Lab, Bangalore, India


---------------------------------------patch------------------------------------

diff -urN linux-2.4.2+rc/arch/i386/config.in linux-2.4.2+rc+fs/arch/i386/config.in
--- linux-2.4.2+rc/arch/i386/config.in Thu Apr 5 16:10:07 2001
+++ linux-2.4.2+rc+fs/arch/i386/config.in Fri Apr 6 12:50:16 2001
@@ -155,6 +155,9 @@
bool 'Math emulation' CONFIG_MATH_EMULATION
bool 'MTRR (Memory Type Range Register) support' CONFIG_MTRR
bool 'Read-Copy-Update lock support' CONFIG_RCLOCK
+if [ "$CONFIG_RCLOCK" = "y" ]; then
+ bool 'File Descriptor Management using RCU' CONFIG_FD_RCU
+fi
bool 'Symmetric multi-processing support' CONFIG_SMP
if [ "$CONFIG_SMP" != "y" ]; then
bool 'APIC and IO-APIC support on uniprocessors' CONFIG_X86_UP_IOAPIC
diff -urN linux-2.4.2+rc/drivers/char/tty_io.c linux-2.4.2+rc+fs/drivers/char/tty_io.c
--- linux-2.4.2+rc/drivers/char/tty_io.c Wed Feb 14 02:45:04 2001
+++ linux-2.4.2+rc+fs/drivers/char/tty_io.c Thu Apr 5 18:04:45 2001
@@ -1841,7 +1841,9 @@
}
task_lock(p);
if (p->files) {
+#ifndef CONFIG_FD_RCU
read_lock(&p->files->file_lock);
+#endif
/* FIXME: p->files could change */
for (i=0; i < p->files->max_fds; i++) {
filp = fcheck_files(p->files, i);
@@ -1851,7 +1853,9 @@
break;
}
}
+#ifndef CONFIG_FD_RCU
read_unlock(&p->files->file_lock);
+#endif
}
task_unlock(p);
}
diff -urN linux-2.4.2+rc/fs/exec.c linux-2.4.2+rc+fs/fs/exec.c
--- linux-2.4.2+rc/fs/exec.c Sat Feb 10 08:33:39 2001
+++ linux-2.4.2+rc+fs/fs/exec.c Thu Apr 5 17:49:44 2001
@@ -475,7 +475,11 @@
{
long j = -1;

+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif
for (;;) {
unsigned long set, i;

@@ -487,16 +491,28 @@
if (!set)
continue;
files->close_on_exec->fds_bits[j] = 0;
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
for ( ; set ; i++,set >>= 1) {
if (set & 1) {
sys_close(i);
}
}
+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif

}
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
}

/*
diff -urN linux-2.4.2+rc/fs/fcntl.c linux-2.4.2+rc+fs/fs/fcntl.c
--- linux-2.4.2+rc/fs/fcntl.c Thu Nov 16 12:20:25 2000
+++ linux-2.4.2+rc+fs/fs/fcntl.c Thu Apr 5 18:00:41 2001
@@ -63,7 +63,11 @@
int error;
int start;

+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif

repeat:
/*
@@ -109,7 +113,11 @@
{
FD_SET(fd, files->open_fds);
FD_CLR(fd, files->close_on_exec);
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
fd_install(fd, file);
}

@@ -125,7 +133,11 @@
return ret;

out_putf:
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
fput(file);
return ret;
}
@@ -136,7 +148,11 @@
struct file * file, *tofree;
struct files_struct * files = current->files;

+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif
if (!(file = fcheck(oldfd)))
goto out_unlock;
err = newfd;
@@ -167,7 +183,11 @@
files->fd[newfd] = file;
FD_SET(newfd, files->open_fds);
FD_CLR(newfd, files->close_on_exec);
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif

if (tofree)
filp_close(tofree, files);
@@ -175,11 +195,19 @@
out:
return err;
out_unlock:
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
goto out;

out_fput:
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
fput(file);
goto out;
}
diff -urN linux-2.4.2+rc/fs/file.c linux-2.4.2+rc+fs/fs/file.c
--- linux-2.4.2+rc/fs/file.c Sat Feb 10 00:59:44 2001
+++ linux-2.4.2+rc+fs/fs/file.c Thu Apr 5 17:39:29 2001
@@ -14,6 +14,17 @@

#include <asm/bitops.h>

+#ifdef CONFIG_FD_RCU
+#include <linux/rclock.h>
+
+struct fdset_callback_func_args {
+ int nfds;
+ fd_set *new_openset;
+ fd_set *new_execset;
+};
+typedef struct fdset_callback_func_args fdset_callback_func_args_t;
+
+#endif

/*
* Allocate an fd array, using kmalloc or vmalloc.
@@ -48,6 +59,109 @@
vfree(array);
}

+
+#ifdef CONFIG_FD_RCU
+
+int fd_array_callback_func(rc_callback_t *cb, void *arg1, void *arg2)
+{
+ struct file **old_fds = (struct file **)arg1;
+ unsigned long nfds = (unsigned long ) arg2;
+
+ free_fd_array(old_fds, nfds);
+ rc_free_callback(cb);
+
+ return 0;
+}
+
+int expand_fd_array(struct files_struct *files, int nr)
+{
+ struct file **new_fds;
+ int error, nfds;
+ rc_callback_t *fd_cb;
+
+ error = -EMFILE;
+ if (files->max_fds >= NR_OPEN || nr >= NR_OPEN)
+ goto out;
+
+ nfds = files->max_fds;
+ spin_unlock(&files->file_lock);
+
+ /*
+ * Expand to the max in easy steps, and keep expanding it until
+ * we have enough for the requested fd array size.
+ */
+
+ do {
+#if NR_OPEN_DEFAULT < 256
+ if (nfds < 256)
+ nfds = 256;
+ else
+#endif
+ if (nfds < (PAGE_SIZE / sizeof(struct file *)))
+ nfds = PAGE_SIZE / sizeof(struct file *);
+ else {
+ nfds = nfds * 2;
+ if (nfds > NR_OPEN)
+ nfds = NR_OPEN;
+ }
+ } while (nfds <= nr);
+
+ error = -ENOMEM;
+
+ fd_cb = rc_alloc_callback(fd_array_callback_func, NULL, NULL,
+ GFP_ATOMIC);
+ new_fds = alloc_fd_array(nfds);
+ spin_lock(&files->file_lock);
+ if (!new_fds) {
+ rc_free_callback(fd_cb);
+ goto out;
+ }
+
+ /* Copy the existing array and install the new pointer */
+
+ if (nfds > files->max_fds) {
+ struct file **old_fds = files->fd;
+ int i = files->max_fds;
+
+
+ /* Don't copy/clear the array if we are creating a new
+ fd array for fork() */
+ if (i) {
+ memcpy(new_fds, old_fds, i * sizeof(struct file *));
+ /* clear the remainder of the array */
+ memset(&new_fds[i], 0,
+ (nfds-i) * sizeof(struct file *));
+ }
+
+ RC_MEMSYNC();
+
+ old_fds = xchg(&files->fd, new_fds);
+
+ RC_MEMSYNC();
+
+ i = xchg(&files->max_fds, nfds);
+
+ if (i) {
+ fd_cb->arg1 = old_fds;
+ fd_cb->arg2 = (void *) i;
+ rc_callback(fd_cb);
+ } else {
+ rc_free_callback(fd_cb);
+ }
+
+ } else {
+ /* Somebody expanded the array while we slept ... */
+ spin_unlock(&files->file_lock);
+ free_fd_array(new_fds, nfds);
+ rc_free_callback(fd_cb);
+ spin_lock(&files->file_lock);
+ }
+ error = 0;
+out:
+ return error;
+}
+
+#else /* CONFIG_FD_RCU */
/*
* Expand the fd array in the files_struct. Called with the files
* spinlock held for write.
@@ -123,6 +237,7 @@
out:
return error;
}
+#endif /* CONFIG_FD_RCU */

/*
* Allocate an fdset array, using kmalloc or vmalloc.
@@ -157,6 +272,117 @@
vfree(array);
}

+#ifdef CONFIG_FD_RCU
+
+int fdset_callback_func(rc_callback_t *cb, void *arg1, void *arg2)
+{
+ int nfds = ((fdset_callback_func_args_t *)arg1)->nfds;
+ fd_set *new_openset = ((fdset_callback_func_args_t *)arg1)->new_openset;
+ fd_set *new_execset = ((fdset_callback_func_args_t *)arg1)->new_execset;
+
+ free_fdset (new_openset, nfds);
+ free_fdset (new_execset, nfds);
+
+ kfree(arg1);
+ rc_free_callback(cb);
+
+ return 0;
+}
+
+int expand_fdset(struct files_struct *files, int nr)
+{
+ fd_set *new_openset = 0, *new_execset = 0;
+ int error, nfds = 0;
+ rc_callback_t *fd_cb = NULL;
+ fdset_callback_func_args_t *args = NULL;
+
+ error = -EMFILE;
+ if (files->max_fdset >= NR_OPEN || nr >= NR_OPEN)
+ goto out;
+
+ nfds = files->max_fdset;
+ spin_unlock(&files->file_lock);
+
+ /* Expand to the max in easy steps */
+ do {
+ if (nfds < (PAGE_SIZE * 8))
+ nfds = PAGE_SIZE * 8;
+ else {
+ nfds = nfds * 2;
+ if (nfds > NR_OPEN)
+ nfds = NR_OPEN;
+ }
+ } while (nfds <= nr);
+
+ error = -ENOMEM;
+ new_openset = alloc_fdset(nfds);
+ new_execset = alloc_fdset(nfds);
+
+ fd_cb = rc_alloc_callback(fdset_callback_func, NULL, NULL, GFP_ATOMIC);
+ args = (fdset_callback_func_args_t *)kmalloc(sizeof(args), GFP_ATOMIC);
+
+ spin_lock(&files->file_lock);
+ if (!new_openset || !new_execset)
+ goto out;
+
+ error = 0;
+
+ /* Copy the existing tables and install the new pointers */
+ if (nfds > files->max_fdset) {
+ int i = files->max_fdset / (sizeof(unsigned long) * 8);
+ int count = (nfds - files->max_fdset) / 8;
+
+ /*
+ * Don't copy the entire array if the current fdset is
+ * not yet initialised.
+ */
+ if (i) {
+ memcpy (new_openset, files->open_fds, files->max_fdset/8);
+ memcpy (new_execset, files->close_on_exec, files->max_fdset/8);
+ memset (&new_openset->fds_bits[i], 0, count);
+ memset (&new_execset->fds_bits[i], 0, count);
+ }
+
+
+ RC_MEMSYNC();
+
+ nfds = xchg(&files->max_fdset, nfds);
+
+ RC_MEMSYNC();
+
+ new_openset = xchg(&files->open_fds, new_openset);
+
+ RC_MEMSYNC();
+
+ new_execset = xchg(&files->close_on_exec, new_execset);
+
+ args->nfds = nfds;
+ args->new_openset = new_openset;
+ args->new_execset = new_execset;
+
+ fd_cb->arg1 = args;
+
+ rc_callback(fd_cb);
+
+ return 0;
+ }
+ /* Somebody expanded the array while we slept ... */
+
+out:
+ spin_unlock(&files->file_lock);
+ if (new_openset)
+ free_fdset(new_openset, nfds);
+ if (new_execset)
+ free_fdset(new_execset, nfds);
+ if (fd_cb)
+ rc_free_callback(fd_cb);
+ if (args)
+ kfree(args);
+ spin_lock(&files->file_lock);
+ return error;
+}
+
+#else
/*
* Expand the fdset in the files_struct. Called with the files spinlock
* held for write.
@@ -230,3 +456,4 @@
return error;
}

+#endif /* CONFIG_FD_RCU */
diff -urN linux-2.4.2+rc/fs/file_table.c linux-2.4.2+rc+fs/fs/file_table.c
--- linux-2.4.2+rc/fs/file_table.c Tue Dec 5 23:57:31 2000
+++ linux-2.4.2+rc+fs/fs/file_table.c Fri Apr 6 19:25:43 2001
@@ -127,11 +127,15 @@
struct file * file;
struct files_struct *files = current->files;

+#ifndef CONFIG_FD_RCU
read_lock(&files->file_lock);
+#endif
file = fcheck(fd);
if (file)
get_file(file);
+#ifndef CONFIG_FD_RCU
read_unlock(&files->file_lock);
+#endif
return file;
}

diff -urN linux-2.4.2+rc/fs/open.c linux-2.4.2+rc+fs/fs/open.c
--- linux-2.4.2+rc/fs/open.c Sat Feb 10 00:59:44 2001
+++ linux-2.4.2+rc+fs/fs/open.c Thu Apr 5 17:45:22 2001
@@ -16,6 +16,9 @@
#include <linux/tty.h>

#include <asm/uaccess.h>
+#ifdef CONFIG_FD_RCU
+#include <linux/rclock.h>
+#endif

#define special_file(m) (S_ISCHR(m)||S_ISBLK(m)||S_ISFIFO(m)||S_ISSOCK(m))

@@ -688,7 +691,11 @@
int fd, error;

error = -EMFILE;
+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif

repeat:
fd = find_next_zero_bit(files->open_fds,
@@ -737,7 +744,11 @@
error = fd;

out:
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
return error;
}

@@ -818,7 +829,11 @@
struct file * filp;
struct files_struct *files = current->files;

+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif
if (fd >= files->max_fds)
goto out_unlock;
filp = files->fd[fd];
@@ -827,11 +842,19 @@
files->fd[fd] = NULL;
FD_CLR(fd, files->close_on_exec);
__put_unused_fd(files, fd);
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
return filp_close(filp, files);

out_unlock:
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
return -EBADF;
}

diff -urN linux-2.4.2+rc/fs/proc/base.c linux-2.4.2+rc+fs/fs/proc/base.c
--- linux-2.4.2+rc/fs/proc/base.c Fri Nov 17 02:48:26 2000
+++ linux-2.4.2+rc+fs/fs/proc/base.c Thu Apr 5 18:03:44 2001
@@ -739,12 +739,16 @@
task_unlock(task);
if (!files)
goto out_unlock;
+#ifndef CONFIG_FD_RCU
read_lock(&files->file_lock);
+#endif
file = inode->u.proc_i.file = fcheck_files(files, fd);
if (!file)
goto out_unlock2;
get_file(file);
+#ifndef CONFIG_FD_RCU
read_unlock(&files->file_lock);
+#endif
put_files_struct(files);
inode->i_op = &proc_pid_link_inode_operations;
inode->i_size = 64;
@@ -760,7 +764,9 @@

out_unlock2:
put_files_struct(files);
+#ifndef CONFIG_FD_RCU
read_unlock(&files->file_lock);
+#endif
out_unlock:
iput(inode);
out:
diff -urN linux-2.4.2+rc/fs/select.c linux-2.4.2+rc+fs/fs/select.c
--- linux-2.4.2+rc/fs/select.c Sat Feb 10 00:59:44 2001
+++ linux-2.4.2+rc+fs/fs/select.c Thu Apr 5 17:50:20 2001
@@ -166,9 +166,13 @@
int retval, i, off;
long __timeout = *timeout;

+#ifndef CONFIG_FD_RCU
read_lock(&current->files->file_lock);
+#endif
retval = max_select_fd(n, fds);
+#ifndef CONFIG_FD_RCU
read_unlock(&current->files->file_lock);
+#endif

if (retval < 0)
return retval;
diff -urN linux-2.4.2+rc/include/linux/file.h linux-2.4.2+rc+fs/include/linux/file.h
--- linux-2.4.2+rc/include/linux/file.h Wed Aug 23 23:52:26 2000
+++ linux-2.4.2+rc+fs/include/linux/file.h Thu Apr 5 16:34:02 2001
@@ -12,21 +12,33 @@
{
struct files_struct *files = current->files;
int res;
+#ifndef CONFIG_FD_RCU
read_lock(&files->file_lock);
+#endif
res = FD_ISSET(fd, files->close_on_exec);
+#ifndef CONFIG_FD_RCU
read_unlock(&files->file_lock);
+#endif
return res;
}

static inline void set_close_on_exec(unsigned int fd, int flag)
{
struct files_struct *files = current->files;
+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif
if (flag)
FD_SET(fd, files->close_on_exec);
else
FD_CLR(fd, files->close_on_exec);
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
}

static inline struct file * fcheck_files(struct files_struct *files, unsigned int fd)
@@ -66,9 +78,17 @@
{
struct files_struct *files = current->files;

+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif
__put_unused_fd(files, fd);
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
}

/*
@@ -87,12 +107,21 @@
static inline void fd_install(unsigned int fd, struct file * file)
{
struct files_struct *files = current->files;
-
+
+#ifdef CONFIG_FD_RCU
+ spin_lock(&files->file_lock);
+#else
write_lock(&files->file_lock);
+#endif
if (files->fd[fd])
BUG();
files->fd[fd] = file;
+
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&files->file_lock);
+#else
write_unlock(&files->file_lock);
+#endif
}

void put_files_struct(struct files_struct *fs);
diff -urN linux-2.4.2+rc/include/linux/sched.h linux-2.4.2+rc+fs/include/linux/sched.h
--- linux-2.4.2+rc/include/linux/sched.h Thu Feb 22 05:39:58 2001
+++ linux-2.4.2+rc+fs/include/linux/sched.h Fri Apr 6 14:09:18 2001
@@ -167,7 +167,11 @@
*/
struct files_struct {
atomic_t count;
+#ifdef CONFIG_FD_RCU
+ spinlock_t file_lock;
+#else
rwlock_t file_lock;
+#endif
int max_fds;
int max_fdset;
int next_fd;
@@ -179,6 +183,22 @@
struct file * fd_array[NR_OPEN_DEFAULT];
};

+#ifdef CONFIG_FD_RCU
+#define INIT_FILES \
+{ \
+ count: ATOMIC_INIT(1), \
+ file_lock: SPIN_LOCK_UNLOCKED, \
+ max_fds: NR_OPEN_DEFAULT, \
+ max_fdset: __FD_SETSIZE, \
+ next_fd: 0, \
+ fd: &init_files.fd_array[0], \
+ close_on_exec: &init_files.close_on_exec_init, \
+ open_fds: &init_files.open_fds_init, \
+ close_on_exec_init: { { 0, } }, \
+ open_fds_init: { { 0, } }, \
+ fd_array: { NULL, } \
+}
+#else
#define INIT_FILES \
{ \
count: ATOMIC_INIT(1), \
@@ -193,6 +213,7 @@
open_fds_init: { { 0, } }, \
fd_array: { NULL, } \
}
+#endif

/* Maximum number of active map areas.. This is a random (large) number */
#define MAX_MAP_COUNT (65536)
diff -urN linux-2.4.2+rc/kernel/fork.c linux-2.4.2+rc+fs/kernel/fork.c
--- linux-2.4.2+rc/kernel/fork.c Sat Feb 10 00:59:44 2001
+++ linux-2.4.2+rc+fs/kernel/fork.c Thu Apr 5 16:22:08 2001
@@ -24,6 +24,9 @@
#include <asm/uaccess.h>
#include <asm/mmu_context.h>

+#ifdef CONFIG_FD_RCU
+#include <linux/rclock.h>
+#endif
/* The idle threads do not count.. */
int nr_threads;
int nr_running;
@@ -433,7 +436,11 @@

atomic_set(&newf->count, 1);

+#ifdef CONFIG_FD_RCU
+ newf->file_lock = SPIN_LOCK_UNLOCKED;
+#else
newf->file_lock = RW_LOCK_UNLOCKED;
+#endif
newf->next_fd = 0;
newf->max_fds = NR_OPEN_DEFAULT;
newf->max_fdset = __FD_SETSIZE;
@@ -446,13 +453,23 @@
size = oldf->max_fdset;
if (size > __FD_SETSIZE) {
newf->max_fdset = 0;
+#ifdef CONFIG_FD_RCU
+ spin_lock(&newf->file_lock);
+#else
write_lock(&newf->file_lock);
+#endif
error = expand_fdset(newf, size-1);
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&newf->file_lock);
+#else
write_unlock(&newf->file_lock);
+#endif
if (error)
goto out_release;
}
+#ifndef CONFIG_FD_RCU
read_lock(&oldf->file_lock);
+#endif

open_files = count_open_files(oldf, size);

@@ -463,15 +480,27 @@
*/
nfds = NR_OPEN_DEFAULT;
if (open_files > nfds) {
+#ifndef CONFIG_FD_RCU
read_unlock(&oldf->file_lock);
+#endif
newf->max_fds = 0;
+#ifdef CONFIG_FD_RCU
+ spin_lock(&newf->file_lock);
+#else
write_lock(&newf->file_lock);
+#endif
error = expand_fd_array(newf, open_files-1);
+#ifdef CONFIG_FD_RCU
+ spin_unlock(&newf->file_lock);
+#else
write_unlock(&newf->file_lock);
+#endif
if (error)
goto out_release;
nfds = newf->max_fds;
+#ifndef CONFIG_FD_RCU
read_lock(&oldf->file_lock);
+#endif
}

old_fds = oldf->fd;
@@ -486,7 +515,9 @@
get_file(f);
*new_fds++ = f;
}
+#ifndef CONFIG_FD_RCU
read_unlock(&oldf->file_lock);
+#endif

/* compute the remainder to be cleared */
size = (newf->max_fds - open_files) * sizeof(struct file *);
diff -urN linux-2.4.2+rc/net/ipv4/netfilter/ipt_owner.c linux-2.4.2+rc+fs/net/ipv4/netfilter/ipt_owner.c
--- linux-2.4.2+rc/net/ipv4/netfilter/ipt_owner.c Mon Aug 14 01:23:19 2000
+++ linux-2.4.2+rc+fs/net/ipv4/netfilter/ipt_owner.c Thu Apr 5 18:06:21 2001
@@ -25,16 +25,22 @@
task_lock(p);
files = p->files;
if(files) {
+#ifndef CONFIG_FD_RCU
read_lock(&files->file_lock);
+#endif
for (i=0; i < files->max_fds; i++) {
if (fcheck_files(files, i) == skb->sk->socket->file) {
+#ifndef CONFIG_FD_RCU
read_unlock(&files->file_lock);
+#endif
task_unlock(p);
read_unlock(&tasklist_lock);
return 1;
}
}
+#ifndef CONFIG_FD_RCU
read_unlock(&files->file_lock);
+#endif
}
task_unlock(p);
out:
@@ -58,14 +64,18 @@
task_lock(p);
files = p->files;
if (files) {
+#ifndef CONFIG_FD_RCU
read_lock(&files->file_lock);
+#endif
for (i=0; i < files->max_fds; i++) {
if (fcheck_files(files, i) == file) {
found = 1;
break;
}
}
+#ifndef CONFIG_FD_RCU
read_unlock(&files->file_lock);
+#endif
}
task_unlock(p);
if(found)


2001-04-12 01:32:23

by Anton Blanchard

[permalink] [raw]
Subject: Re: [RFC][PATCH] Scalable FD Management using Read-Copy-Update


> This patch provides a very good performance improvement in file
> descriptor management for SMP linux kernel on a 4-way machine with the
> expectation of even higher gains on higher end machines. The patch uses the
> read-copy-update mechanism for Linux, published earlier at the sourceforge
> site under Linux Scalablity Effort project.
> http://lse.sourceforge.net/locking/rclock.html.

Good stuff!

It would be interesting to try a filesystem benchmark such as dbench. On
a quad PPC fget was chewing up more than its fair share of cpu time.

Anton

2001-04-12 15:42:31

by Maneesh Soni

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC][PATCH] Scalable FD Management using Read-Copy-Update

On Wed, Apr 11, 2001 at 06:29:30PM -0700, Anton Blanchard wrote:
>
> > This patch provides a very good performance improvement in file
> > descriptor management for SMP linux kernel on a 4-way machine with the
> > expectation of even higher gains on higher end machines. The patch uses the
> > read-copy-update mechanism for Linux, published earlier at the sourceforge
> > site under Linux Scalablity Effort project.
> > http://lse.sourceforge.net/locking/rclock.html.
>
> Good stuff!
>
> It would be interesting to try a filesystem benchmark such as dbench. On
> a quad PPC fget was chewing up more than its fair share of cpu time.
>
> Anton

Hello Anton,

Thank you for your suggestion. I tried dbench for on a 4-way PIII Xeon box, with
1MB L2 Cache and 1GB of RAM. I ran it on 8 GB ext2 partition with Adaptec 7896
SCSI controller. I ran "dbench 100" and "dbench 200" for five times and took the
average of the throughput and found that for

Base (2.4.2) -
100 Average Throughput = 39.628 MB/sec
200 Average Throughput = 22.792 MB/sec

Base + files_struct patch -
100 Average Throughput = 39.874 MB/sec
200 Average Throughput = 23.174 MB/sec

I found this value quite less than the one present in the README distributed
with dbench tarball. I think the numbers in the README were for a similar
machine but with 2.2.9 kernel.

As you can see the performance with files_struct patch is almost same as base.
I feel I am hitting some bottleneck other than fget() in both base and the
patched versions. I think atleast for base version I should get similar numbers
as mentions in the README for similar configuration. Though I intend to do some
profiling to look into this but it will be helpfull if you can tell me if there
is some known thing regarding this.

I am copying this to Andrew also, if he can also help. Also if you have some
dbench numbers from 2.4.x kernel, please let me have a look into those also.

Thank you,
Maneesh


--
Maneesh Soni <[email protected]>
IBM Linux Technology Center,
IBM Software Lab, Bangalore, India

2001-04-12 15:54:11

by Anton Blanchard

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC][PATCH] Scalable FD Management using Read-Copy-Update


Hi,

> Base (2.4.2) -
> 100 Average Throughput = 39.628 MB/sec
> 200 Average Throughput = 22.792 MB/sec
>
> Base + files_struct patch -
> 100 Average Throughput = 39.874 MB/sec
> 200 Average Throughput = 23.174 MB/sec
>
> I found this value quite less than the one present in the README distributed
> with dbench tarball. I think the numbers in the README were for a similar
> machine but with 2.2.9 kernel.

If you guestimate that each dbench client uses about 20M of RAM then dbench
100 has no chance of remaining in memory. Once you hit disk then spinlock
optimisations are all in the noise :) Smaller runs (< 30) should see
it stay in memory.

Also if you turn of kupdated (so old buffers are not flushed out just
because they are old) and make the VM more agressive about filling
memory with dirty buffers then you will not hit the disk and then
hopefully the optimisations will be more obvious.

killall -STOP kupdated
echo "90 64 64 256 500 3000 95 0 0" > /proc/sys/vm/bdflush

Remember to killall -CONT kupdated when you are finished :)

> I am copying this to Andrew also, if he can also help. Also if you have some
> dbench numbers from 2.4.x kernel, please let me have a look into those also.

The single CPU 333MHz POWER 3 I was playing with got 100MB/s when not
touching disk.

Anton

2001-04-12 18:22:56

by John Hawkes

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [RFC][PATCH] Scalable FD Management using Read-Copy-Update

From: "Maneesh Soni" <[email protected]>
...
> Though I intend to do some
> profiling to look into this but it will be helpfull if you can tell me
if there
> is some known thing regarding this.
...

Another useful tool is Lockmeter, which gives you visibility to what's
happening with spinlock_t and rwlock_t usage.
http://oss.sgi.com/projects/lockmeter

John Hawkes
[email protected]


2001-04-16 16:16:55

by Mark Hahn

[permalink] [raw]
Subject: Re: [RFC][PATCH] Scalable FD Management using Read-Copy-Update

> The improvement in performance while runnig "chat" benchmark
> (from http://lbs.sourceforge.net/) is about 30% in average throughput.

isn't this a solution in search of a problem?
does it make sense to redesign parts of the kernel for the sole
purpose of making a completely unrealistic benchmark run faster?

(the chat "benchmark" is a simple pingpong load-generator; it is
not in the same category as, say, specweb, since it does not do *any*
realistic (nonlocal) IO. the numbers "chat" returns are interesting,
but not indicative of any problem; perhaps even less than lmbench
components.)

2001-04-17 09:24:37

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC][PATCH] Scalable FD Management using Read-Copy-Update

Hi Mike,

[I am not sure if my earlier mail from lycos went out or not, if
it did, I apologize]

On Mon, Apr 16, 2001 at 12:16:25PM -0400, Mark Hahn wrote:
> > The improvement in performance while runnig "chat" benchmark
> > (from http://lbs.sourceforge.net/) is about 30% in average throughput.
>
> isn't this a solution in search of a problem?
> does it make sense to redesign parts of the kernel for the sole
> purpose of making a completely unrealistic benchmark run faster?

Irrespective of the usefulness of the "chat" benchmark, it seems
that there is a problem of scalability as long as CLONE_FILES is
supported. John Hawkes (SGI) posted some nasty numbers on a
32 CPU mips machine in the lse-tech list some time ago.


>
> (the chat "benchmark" is a simple pingpong load-generator; it is
> not in the same category as, say, specweb, since it does not do *any*
> realistic (nonlocal) IO. the numbers "chat" returns are interesting,
> but not indicative of any problem; perhaps even less than lmbench
> components.)

"chat" results for large numbers of CPUs is indicative of a problem -
if a large number of threads share the file_struct through
CLONE_FILES, the performance of the application will deteriorate
beyond 8 CPUs (going by John's numbers). It also indicates how
sensitive can performance be to write access of shared-memory
locations like spin-waiting locks.


Thanks
Dipankar
--
Dipankar Sarma ([email protected])
IBM Linux Technology Center
IBM Software Lab, Bangalore, India.
Project Page: http://lse.sourceforge.net

2001-04-17 10:47:00

by Maneesh Soni

[permalink] [raw]
Subject: Re: Scalable FD Management ....

On Thu, Apr 12, 2001 at 08:51:18AM -0700, Anton Blanchard wrote:
>
> Hi,
>
> > Base (2.4.2) -
> > 100 Average Throughput = 39.628 MB/sec
> > 200 Average Throughput = 22.792 MB/sec
> >
> > Base + files_struct patch -
> > 100 Average Throughput = 39.874 MB/sec
> > 200 Average Throughput = 23.174 MB/sec
> >
> > I found this value quite less than the one present in the README distributed
> > with dbench tarball. I think the numbers in the README were for a similar
> > machine but with 2.2.9 kernel.
>
> If you guestimate that each dbench client uses about 20M of RAM then dbench
> 100 has no chance of remaining in memory. Once you hit disk then spinlock
> optimisations are all in the noise :) Smaller runs (< 30) should see
> it stay in memory.
>
> Also if you turn of kupdated (so old buffers are not flushed out just
> because they are old) and make the VM more agressive about filling
> memory with dirty buffers then you will not hit the disk and then
> hopefully the optimisations will be more obvious.
>
> killall -STOP kupdated
> echo "90 64 64 256 500 3000 95 0 0" > /proc/sys/vm/bdflush
>
> Remember to killall -CONT kupdated when you are finished :)
>
> > I am copying this to Andrew also, if he can also help. Also if you have some
> > dbench numbers from 2.4.x kernel, please let me have a look into those also.
>
> The single CPU 333MHz POWER 3 I was playing with got 100MB/s when not
> touching disk.
>
> Anton
> -

Hi Anton,

I ran the dbench test as per your suggesstions. Now I get similar throughput
numbers as yours.

But still the throughput improvement is not there for my patch. the reason, I
think, is that I didnot get too many hits to fget() routine. It will be helpful
if you can tell how you got fget() chewing up more than its fair share of CPU
time.

For 30 clients:
Base(2.4.2) - Average Throughput = 235.139 MB/S
Base + Files_struct patch - Average Throughput = 235.751 MB/S

I also did profiling while running these tests, using kernprof. The fget hits
are as below
Base(2.4.2) 304
Base + Files_struct patch 189

Though while doing kernprofile'ing my sample size is quite big (the top ranker
in profiling is "default_idle" with 28471 hits). The fget's hit count is quite
low compared to "default_idle".

As you can also see, the files_struct patch is able to reduce the number of hits
to fget by around 37%

I also saw the dbench.c code. It does creates no. of child processes but with
fork() and not through __clone().

I think the fget() will affect the performance in the scenarios where the
childs are created using clone() with CLONE_FILES flag set. That is when many
child processes share parent's files_struct and everybody tries to acquire the
same files->file_lock. And in those scenarios we should see considerable
performance improvement by using the files_struct patch as in the case of "chat"benchmark.

Regards,
Maneesh

--
Maneesh Soni <[email protected]>
http://lse.sourceforge.net/locking/rclock.html
IBM Linux Technology Center,
IBM Software Lab, Bangalore, India

2001-04-17 14:36:56

by Andi Kleen

[permalink] [raw]
Subject: Re: Scalable FD Management ....

On Tue, Apr 17, 2001 at 04:19:16PM +0530, Maneesh Soni wrote:
> But still the throughput improvement is not there for my patch. the reason, I
> think, is that I didnot get too many hits to fget() routine. It will be helpful
> if you can tell how you got fget() chewing up more than its fair share of CPU
> time.

dbench uses fork(), not clone, and therefore has no shared file_structs that
could be contended.


-Andi

2001-04-17 16:59:37

by Mark Hahn

[permalink] [raw]
Subject: Re: [RFC][PATCH] Scalable FD Management using Read-Copy-Update

> > isn't this a solution in search of a problem?
> > does it make sense to redesign parts of the kernel for the sole
> > purpose of making a completely unrealistic benchmark run faster?
>
> Irrespective of the usefulness of the "chat" benchmark, it seems
> that there is a problem of scalability as long as CLONE_FILES is
> supported. John Hawkes (SGI) posted some nasty numbers on a
> 32 CPU mips machine in the lse-tech list some time ago.

that's not the point. the point is that this has every sign of
being premature optimization. the "chat" benchmark does no work,
it only generates load. and yes, indeed, you can cause contention
if you apply enough load in the right places. this does NOT indicate
that any real apps apply the same load in the same places.