LinuxLists.cc - Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

[permalink] [raw]

Subject: Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

On Fri, Oct 27, 2000 at 03:13:33AM -0400, Alexander Viro wrote:
> I didn't look into recent changes in fs/locks.c, but I have quite problem
> inventing a scenario when _adding_ BKL (without reverting other changes)
> might give an absolute improvement. Well, I see a couple of really perverted
> scenarios, but... Seriously, folks, could you compare the 4 variants above
> and gather the contention data for the -test9 on your loads? That would help
> a lot.

I think it is easy to see. Switching between CPUs for criticial section
always has an latency because the lock/data update needs some time to cross
the bus

When you have two CPUs contending on common paths it is better to do:

CPU #0 CPU #1

grab big lock spin
do thing
release big lock ----> latency ---> grab big lock
do thing
do other things ....

than to do

grab small lock 1 spin
do small thing
release small lock 1 --> latency --> get small lock
do small thing
release lock
<--- latency ---
get small lock, fetch data
do small thing
release lock
---> latency ---> get small lock
do small thing
release lock
<--- latency ----

etc. The latencies add up and they're long because they're bus limited.
For common paths it is definitely useful to have bigger lock sections.

-Andi

2000-10-27 08:22:41

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

On Fri, Oct 27, 2000 at 03:13:33AM -0400, Alexander Viro wrote:
>
>
> On Fri, 27 Oct 2000, Jeff V. Merkey wrote:
>
> >
> > Linux has lots of n-sqared linear list searches all over the place, and
> > there's a ton of spots I've seen it go linear by doing fine grained
> > manipulation of lock_kernel() [like in BLOCK.C in NWFS for sending async
> > IO to ll_rw_block()]. I could see where there would be many spots
> > where playing with this would cause problems.
> >
> > 2.5 will be better.
>
> fs/locks.c is one hell of a sick puppy. Nothing new about that. I'm kinda
> curious about "n-squared" searches in other places, though - mind showing
> them?

I've noticed in 2.2.X if I hold lock_kernel() over multiple calls to
ll_rw_block() while feeding it chains of buffer heads > 100 at a time
(one would think this would make the elevator get better numbers by
feeding it a bunch of buffer heads at once rahter than feeding them in
one at a time), performance gets slower rather than faster. I did
some profiling, and ll_rw_block() and functions beneath it was using a
lot of cycles. I think there may be cases where the code that
does the merging and sorting is hitting (O)(N)**2 instead of (O)(N).
When I compare the profile numbers to the buffer head window in
between calls to ll_rw_block(), the relationship is not (O)(N).

By default I just hold lock_kernel() over the calls to ll_rw_block()
for each buffer head in 2.2.X.

2.4 does not need lock_kernel() to be called any longer since ll_rw_block()
is now multi-threaded. The overall model of intereaction has not changed
in 2.4 from 2.2 much. I just no longer have to provide the serialization
for ll_rw_block() because it's provided underneath now, but what I saw
indicates that as the request list gets full, some of the logic is
not performing at a relationship of (O)(N) # requests vs. utilization.

This would indicate somewhere down in that code, there's a spot where
we end up spinning sround a lot mucking with lists or something.

:-)

Jeff

>
> BTW, what spinlocks get contention in variant without BKL? And what about
> comparison between the BKL and non-BKL versions? If it's something like
> BKL no BKL
> 4-way 50 20
> 8-way 30 30
> - something is certainly wrong, but restoring the BKL is _not_ a win.
>
> I didn't look into recent changes in fs/locks.c, but I have quite problem
> inventing a scenario when _adding_ BKL (without reverting other changes)
> might give an absolute improvement. Well, I see a couple of really perverted
> scenarios, but... Seriously, folks, could you compare the 4 variants above
> and gather the contention data for the -test9 on your loads? That would help
> a lot.
>

2000-10-27 10:12:19

by kumon

[permalink] [raw]

Subject: Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Alexander Viro writes:
> BTW, what spinlocks get contention in variant without BKL? And what about
> comparison between the BKL and non-BKL versions? If it's something like
> BKL no BKL
> 4-way 50 20
> 8-way 30 30
> - something is certainly wrong, but restoring the BKL is _not_ a win.

the measured performance is:
4way->8way
2.4.0-test1 2816->3702 (31%up)
2.4.0-test8 4006->5287 (63%up)
2.4.0-test9 3669->2193 (40%down)

So, roughly speaking,
BKL no BKL
4-way 40 36
8-way 53 22
- Above are the comparison between test8 and test9.

Test9 with BKL shows 5249 @8way, nearly equal to test8 @8way.

> scenarios, but... Seriously, folks, could you compare the 4 variants above
> and gather the contention data for the -test9 on your loads? That would help
> a lot.

What kind of contention data you requested?
* kernel-lock spining time
* semaphore task switching time
...

Now, I understand the difference of test8 and test9.
test8 uses kernel_lock() to protect critical region,
test9 uses up/down semaphor for that.

FYI, I resend following data again.

> The following table shows one minute average of "vmstat 1".
>
> configuration Req/s | r b intr c-sw user os idle
> ------------------------+--------------------------------------------------
> 2.4.0t1 4cpu 2816 | 60 0 28877 5103 18 82 0
> 8cpu 3702 | 37 0 33495 16387 14 86 0
> 2.4.0t8 4cpu 4066 | 57 0 38743 8674 27 73 0
> 8cpu 5287 | 33 0 40755 30626 21 78 0
> 2.4.0t9 4cpu 3669 | 53 3 35822 18898 25 73 2
> 8cpu 2193 | 5 8 22114 94609 9 52 39

context switch count on test9 is 2-3 times bigger.

And, using the above table, average transaction time and its
breakdowns can be calculated as follows.

time/req USER OS IDLE
2.4.0t1 4cpu 1.42ms | 0.26ms 1.16ms 0ms
8cpu 2.16ms | 0.30ms 1.85ms 0ms
2.4.0t8 4cpu 0.98ms | 0.26ms 0.72ms 0ms
8cpu 1.44ms | 0.30ms 1.12ms 0ms
2.4.0t9 4cpu 1.09ms | 0.27ms 0.80ms 0.02ms
8cpu 3.65ms | 0.33ms 1.90ms 1.42ms

Test9 needs longer OS time than test8, especialy on 8cpu (80% longer).
Test9 needs longer USER time than test8 on 8cpu.
Test9 idle too much.

Frequent context switchs eat OS time.
Frequent process migrations eat USER time.

I think, the critical region is too small to be protected by semaphor
and task-switch.

--
Computer Systems Laboratory, Fujitsu Labs.
[email protected]

2000-10-27 10:25:05

[permalink] [raw]

Subject: Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Andi Kleen wrote:
>
> When you have two CPUs contending on common paths it is better to do:
> [ spinlock stuff ]

Andi, if the lock_kernel() is removed then the first time the CPUs will butt heads is on a semaphore. This is much more expensive.

I bet if acquire_fl_sem() and release_fl_sem() are turned into lock_kernel()/unlock_kernel() then the scalability will come back.

2000-10-27 10:26:15

[permalink] [raw]

Subject: Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

On Fri, Oct 27, 2000 at 09:23:56PM +1100, Andrew Morton wrote:
> Andi Kleen wrote:
> >
> > When you have two CPUs contending on common paths it is better to do:
> > [ spinlock stuff ]
>
> Andi, if the lock_kernel() is removed then the first time the CPUs will butt heads is on a semaphore. This is much more expensive.
>
> I bet if acquire_fl_sem() and release_fl_sem() are turned into lock_kernel()/unlock_kernel() then the scalability will come back.

To test your theory it would be enough to watch context switch rates in top

-Andi

2000-10-27 12:57:55

by kumon

[permalink] [raw]

Subject: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Andrew Morton writes:
> I bet if acquire_fl_sem() and release_fl_sem() are turned into lock_kernel()/unlock_kernel() then the scalability will come back.

Change the following two macros:
acquire_fl_sem()->lock_kernel()
release_fl_sem()->unlock_kernel()
then
5192 Req/s @8cpu is got. It is same as test8 within fluctuation.

> 4way->8way
> 2.4.0-test1 2816->3702 (31%up)
> 2.4.0-test8 4006->5287 (63%up)
> 2.4.0-test9 3669->2193 (40%down)

Be cautious, interruptible_sleep_on_locked() uses up()/down directly
without using acquire_fl_sem()/release_fl_sem().

As far as the logic of original test9 is correct, big kernel lock is
not needed now.

Changing lock_kernel() into local-lock, e.g. spin_lock(&mylock), I've
expected the performance be more improved, but it shows the same
performance. This means most of the kernel outdates BKL already.

I put the patch to change all occurrence of semaphore "file_lock_sem"
into spinlock variable "file_lock_lock".

Computer Systems Laboratory, Fujitsu Labs.
[email protected]

diff -ru linux-2.4.0-test9/fs/lockd/clntlock.c linux-2.4.0-test9-test/fs/lockd/clntlock.c
--- linux-2.4.0-test9/fs/lockd/clntlock.c Sat Sep 23 06:21:18 2000
+++ linux-2.4.0-test9-test/fs/lockd/clntlock.c Fri Oct 27 21:26:28 2000
@@ -168,7 +168,7 @@
* reclaim is in progress */
lock_kernel();
lockd_up();
- down(&file_lock_sem);
+ spin_lock(&file_lock_lock);

/* First, reclaim all locks that have been granted previously. */
restart:
@@ -186,7 +186,7 @@
}
tmp = tmp->next;
}
- up(&file_lock_sem);
+ spin_unlock(&file_lock_lock);

host->h_reclaiming = 0;
wake_up(&host->h_gracewait);
diff -ru linux-2.4.0-test9/fs/locks.c linux-2.4.0-test9-test/fs/locks.c
--- linux-2.4.0-test9/fs/locks.c Mon Oct 2 11:45:29 2000
+++ linux-2.4.0-test9-test/fs/locks.c Fri Oct 27 21:30:25 2000
@@ -125,10 +125,10 @@
#include <asm/semaphore.h>
#include <asm/uaccess.h>

-DECLARE_MUTEX(file_lock_sem);
+spinlock_t file_lock_lock = SPIN_LOCK_UNLOCKED;
+#define acquire_fl_lock() spin_lock(&file_lock_lock);
+#define release_fl_lock() spin_unlock(&file_lock_lock);

-#define acquire_fl_sem() down(&file_lock_sem)
-#define release_fl_sem() up(&file_lock_sem)

int leases_enable = 1;
int lease_break_time = 45;
@@ -563,7 +563,7 @@
return (locks_conflict(caller_fl, sys_fl));
}

-int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, struct semaphore *sem, int timeout)
+int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int timeout)
{
int result = 0;
wait_queue_t wait;
@@ -571,14 +571,14 @@

__add_wait_queue(fl_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
- up(sem);
+ release_fl_lock();
if (timeout == 0)
schedule();
else
result = schedule_timeout(timeout);
if (signal_pending(current))
result = -ERESTARTSYS;
- down(sem);
+ acquire_fl_lock();
remove_wait_queue(fl_wait, &wait);
current->state = TASK_RUNNING;
return result;
@@ -588,7 +588,7 @@
{
int result;
locks_insert_block(blocker, waiter);
- result = interruptible_sleep_on_locked(&waiter->fl_wait, &file_lock_sem, 0);
+ result = interruptible_sleep_on_locked(&waiter->fl_wait, 0);
locks_delete_block(waiter);
return result;
}
@@ -597,7 +597,7 @@
{
int result;
locks_insert_block(blocker, waiter);
- result = interruptible_sleep_on_locked(&waiter->fl_wait, &file_lock_sem, time);
+ result = interruptible_sleep_on_locked(&waiter->fl_wait, time);
locks_delete_block(waiter);
return result;
}
@@ -607,14 +607,14 @@
{
struct file_lock *cfl;

- acquire_fl_sem();
+ acquire_fl_lock();
for (cfl = filp->f_dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
if (!(cfl->fl_flags & FL_POSIX))
continue;
if (posix_locks_conflict(cfl, fl))
break;
}
- release_fl_sem();
+ release_fl_lock();

return (cfl);
}
@@ -670,14 +670,14 @@
/*
* Search the lock list for this inode for any POSIX locks.
*/
- acquire_fl_sem();
+ acquire_fl_lock();
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!(fl->fl_flags & FL_POSIX))
continue;
if (fl->fl_owner != owner)
break;
}
- release_fl_sem();
+ release_fl_lock();
return fl ? -EAGAIN : 0;
}

@@ -698,7 +698,7 @@
new_fl->fl_end = offset + count - 1;

error = 0;
- acquire_fl_sem();
+ acquire_fl_lock();

repeat:
/* Search the lock list for this inode for locks that conflict with
@@ -731,7 +731,7 @@
}
}
locks_free_lock(new_fl);
- release_fl_sem();
+ release_fl_lock();
return error;
}

@@ -849,7 +849,7 @@
if (!(new_fl && new_fl2))
goto out;

- acquire_fl_sem();
+ acquire_fl_lock();
if (caller->fl_type != F_UNLCK) {
repeat:
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
@@ -994,7 +994,7 @@
locks_wake_up_blocks(left, 0);
}
out:
- release_fl_sem();
+ release_fl_lock();
/*
* Free any unused locks.
*/
@@ -1040,7 +1040,7 @@

alloc_err = lease_alloc(NULL, 0, &new_fl);

- acquire_fl_sem();
+ acquire_fl_lock();
flock = inode->i_flock;
if (flock->fl_type & F_INPROGRESS) {
if ((mode & O_NONBLOCK)
@@ -1109,7 +1109,7 @@
}

out:
- release_fl_sem();
+ release_fl_lock();
if (!alloc_err)
locks_free_lock(new_fl);
return error;
@@ -1212,7 +1212,7 @@

before = &inode->i_flock;

- acquire_fl_sem();
+ acquire_fl_lock();

while ((fl = *before) != NULL) {
if (fl->fl_flags != FL_LEASE)
@@ -1263,7 +1263,7 @@
filp->f_owner.uid = current->uid;
filp->f_owner.euid = current->euid;
out_unlock:
- release_fl_sem();
+ release_fl_lock();
return error;
}

@@ -1309,10 +1309,10 @@
&& !(filp->f_mode & 3))
goto out_putf;

- acquire_fl_sem();
+ acquire_fl_lock();
error = flock_lock_file(filp, type,
(cmd & (LOCK_UN | LOCK_NB)) ? 0 : 1);
- release_fl_sem();
+ release_fl_lock();

out_putf:
fput(filp);
@@ -1645,7 +1645,7 @@
*/
return;
}
- acquire_fl_sem();
+ acquire_fl_lock();
before = &inode->i_flock;
while ((fl = *before) != NULL) {
if ((fl->fl_flags & FL_POSIX) && fl->fl_owner == owner) {
@@ -1654,7 +1654,7 @@
}
before = &fl->fl_next;
}
- release_fl_sem();
+ release_fl_lock();
}

/*
@@ -1669,7 +1669,7 @@
if (!inode->i_flock)
return;

- acquire_fl_sem();
+ acquire_fl_lock();
before = &inode->i_flock;

while ((fl = *before) != NULL) {
@@ -1680,7 +1680,7 @@
}
before = &fl->fl_next;
}
- release_fl_sem();
+ release_fl_lock();
}

/**
@@ -1693,9 +1693,9 @@
void
posix_block_lock(struct file_lock *blocker, struct file_lock *waiter)
{
- acquire_fl_sem();
+ acquire_fl_lock();
locks_insert_block(blocker, waiter);
- release_fl_sem();
+ release_fl_lock();
}

/**
@@ -1707,12 +1707,12 @@
void
posix_unblock_lock(struct file_lock *waiter)
{
- acquire_fl_sem();
+ acquire_fl_lock();
if (!list_empty(&waiter->fl_list)) {
locks_delete_block(waiter);
wake_up(&waiter->fl_wait);
}
- release_fl_sem();
+ release_fl_lock();
}

static void lock_get_status(char* out, struct file_lock *fl, int id, char *pfx)
@@ -1803,7 +1803,7 @@
off_t pos = 0;
int i = 0;

- acquire_fl_sem();
+ acquire_fl_lock();
list_for_each(tmp, &file_lock_list) {
struct list_head *btmp;
struct file_lock *fl = list_entry(tmp, struct file_lock, fl_link);
@@ -1824,7 +1824,7 @@
}
}
done:
- release_fl_sem();
+ release_fl_lock();
*start = buffer;
if(q-buffer < length)
return (q-buffer);
@@ -1849,7 +1849,7 @@
{
struct file_lock *fl;
int result = 1;
- acquire_fl_sem();
+ acquire_fl_lock();
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (fl->fl_flags == FL_POSIX) {
if (fl->fl_type == F_RDLCK)
@@ -1866,7 +1866,7 @@
result = 0;
break;
}
- release_fl_sem();
+ release_fl_lock();
return result;
}

@@ -1887,7 +1887,7 @@
{
struct file_lock *fl;
int result = 1;
- acquire_fl_sem();
+ acquire_fl_lock();
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (fl->fl_flags == FL_POSIX) {
if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -1902,7 +1902,7 @@
result = 0;
break;
}
- release_fl_sem();
+ release_fl_lock();
return result;
}
#endif
diff -ru linux-2.4.0-test9/include/linux/fs.h linux-2.4.0-test9-test/include/linux/fs.h
--- linux-2.4.0-test9/include/linux/fs.h Tue Oct 17 14:21:53 2000
+++ linux-2.4.0-test9-test/include/linux/fs.h Fri Oct 27 21:27:15 2000
@@ -547,7 +547,7 @@
#endif

extern struct list_head file_lock_list;
-extern struct semaphore file_lock_sem;
+extern spinlock_t file_lock_lock;

#include <linux/fcntl.h>

diff -ru linux-2.4.0-test9/kernel/ksyms.c linux-2.4.0-test9-test/kernel/ksyms.c
--- linux-2.4.0-test9/kernel/ksyms.c Tue Sep 26 08:18:55 2000
+++ linux-2.4.0-test9-test/kernel/ksyms.c Fri Oct 27 21:27:32 2000
@@ -215,7 +215,7 @@
EXPORT_SYMBOL(page_hash_bits);
EXPORT_SYMBOL(page_hash_table);
EXPORT_SYMBOL(file_lock_list);
-EXPORT_SYMBOL(file_lock_sem);
+EXPORT_SYMBOL(file_lock_lock);
EXPORT_SYMBOL(locks_init_lock);
EXPORT_SYMBOL(locks_copy_lock);
EXPORT_SYMBOL(posix_lock_file);

2000-10-28 15:47:08

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

[email protected] wrote:
>
> Change the following two macros:
> acquire_fl_sem()->lock_kernel()
> release_fl_sem()->unlock_kernel()
> then
> 5192 Req/s @8cpu is got. It is same as test8 within fluctuation.

hmm.. BKL increases scalability. News at 11.

The big question is: why is Apache using file locking so
much? Is this normal behaviour for Apache?

Because if so, the file locking code will be significantly
bad for the scalability of Apache on SMP (of all things!).
It basically grabs a big global lock for _anything_. It
looks like it could be a lot more granular.

>
> I put the patch to change all occurrence of semaphore "file_lock_sem"
> into spinlock variable "file_lock_lock".

Unfortunately this is deadlocky - in several places functions
which can sleep are called when file_lock_lock is held.

More unfortunately I think we _already_ have a deadlock in
the file locking:

locks_wake_up_blocks() is always called with file_lock_sem held.
locks_wake_up_blocks() calls waiter->fl_notify(), which is really nlmsvc_notify_blocked()
nlmsvc_notify_blocked() calls posix_unblock_lock()
posix_unblock_lock() calls acquire_fl_sem()

Can someone please confirm this?

Anyway, here is a modified version of your patch which
fixes everything.

But I'm not sure that it should be applied. I think it's
more expedient at this time to convert acquire_fl_sem/release_fl_sem
into lock_kernel/unlock_kernel (so we _can_ sleep) and to fix the
above alleged deadlock via the creation of __posix_unblock_lock() as in
this patch.

--- linux-2.4.0-test10-pre5/fs/lockd/clntlock.c Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/lockd/clntlock.c Sun Oct 29 00:39:15 2000
@@ -168,10 +168,10 @@
* reclaim is in progress */
lock_kernel();
lockd_up();
- down(&file_lock_sem);

/* First, reclaim all locks that have been granted previously. */
restart:
+ spin_lock(&file_lock_lock);
tmp = file_lock_list.next;
while (tmp != &file_lock_list) {
struct file_lock *fl = list_entry(tmp, struct file_lock, fl_link);
@@ -181,12 +181,13 @@
fl->fl_u.nfs_fl.state != host->h_state &&
(fl->fl_u.nfs_fl.flags & NFS_LCK_GRANTED)) {
fl->fl_u.nfs_fl.flags &= ~ NFS_LCK_GRANTED;
+ spin_unlock(&file_lock_lock);
nlmclnt_reclaim(host, fl);
goto restart;
}
tmp = tmp->next;
}
- up(&file_lock_sem);
+ spin_unlock(&file_lock_lock);

host->h_reclaiming = 0;
wake_up(&host->h_gracewait);
--- linux-2.4.0-test10-pre5/fs/locks.c Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/locks.c Sun Oct 29 02:31:10 2000
@@ -125,10 +125,9 @@
#include <asm/semaphore.h>
#include <asm/uaccess.h>

-DECLARE_MUTEX(file_lock_sem);
-
-#define acquire_fl_sem() down(&file_lock_sem)
-#define release_fl_sem() up(&file_lock_sem)
+spinlock_t file_lock_lock = SPIN_LOCK_UNLOCKED;
+#define acquire_fl_lock() spin_lock(&file_lock_lock);
+#define release_fl_lock() spin_unlock(&file_lock_lock);

int leases_enable = 1;
int lease_break_time = 45;
@@ -352,29 +351,27 @@
#endif

/* Allocate a file_lock initialised to this type of lease */
-static int lease_alloc(struct file *filp, int type, struct file_lock **flp)
+static struct file_lock *lease_alloc(struct file *filp, int type)
{
struct file_lock *fl = locks_alloc_lock(1);
- if (fl == NULL)
- return -ENOMEM;
-
- fl->fl_owner = current->files;
- fl->fl_pid = current->pid;
+ if (fl != NULL) {
+ fl->fl_owner = current->files;
+ fl->fl_pid = current->pid;

- fl->fl_file = filp;
- fl->fl_flags = FL_LEASE;
- if (assign_type(fl, type) != 0) {
- locks_free_lock(fl);
- return -EINVAL;
+ fl->fl_file = filp;
+ fl->fl_flags = FL_LEASE;
+ if (assign_type(fl, type) != 0) {
+ locks_free_lock(fl);
+ fl = NULL;
+ } else {
+ fl->fl_start = 0;
+ fl->fl_end = OFFSET_MAX;
+ fl->fl_notify = NULL;
+ fl->fl_insert = NULL;
+ fl->fl_remove = NULL;
+ }
}
- fl->fl_start = 0;
- fl->fl_end = OFFSET_MAX;
- fl->fl_notify = NULL;
- fl->fl_insert = NULL;
- fl->fl_remove = NULL;
-
- *flp = fl;
- return 0;
+ return fl;
}

/* Check if two locks overlap each other.
@@ -431,6 +428,9 @@
/* Wake up processes blocked waiting for blocker.
* If told to wait then schedule the processes until the block list
* is empty, otherwise empty the block list ourselves.
+ *
+ * waiter->fl_notify() is not allowed to block. Otherwise
+ * we could deadlock on file_lock_lock - AKPM.
*/
static void locks_wake_up_blocks(struct file_lock *blocker, unsigned int wait)
{
@@ -457,6 +457,8 @@

/* Insert file lock fl into an inode's lock list at the position indicated
* by pos. At the same time add the lock to the global file lock list.
+ *
+ * Called under file_lock_lock: fl->fl_insert may not sleep - AKPM.
*/
static void locks_insert_lock(struct file_lock **pos, struct file_lock *fl)
{
@@ -562,22 +564,22 @@
return (locks_conflict(caller_fl, sys_fl));
}

-int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, struct semaphore *sem, int timeout)
+static int interruptible_sleep_on_locked(wait_queue_head_t *fl_wait, int timeout)
{
int result = 0;
wait_queue_t wait;
init_waitqueue_entry(&wait, current);

__add_wait_queue(fl_wait, &wait);
- current->state = TASK_INTERRUPTIBLE;
- up(sem);
+ set_current_state(TASK_INTERRUPTIBLE);
+ release_fl_lock();
if (timeout == 0)
schedule();
else
result = schedule_timeout(timeout);
if (signal_pending(current))
result = -ERESTARTSYS;
- down(sem);
+ acquire_fl_lock();
remove_wait_queue(fl_wait, &wait);
current->state = TASK_RUNNING;
return result;
@@ -587,7 +589,7 @@
{
int result;
locks_insert_block(blocker, waiter);
- result = interruptible_sleep_on_locked(&waiter->fl_wait, &file_lock_sem, 0);
+ result = interruptible_sleep_on_locked(&waiter->fl_wait, 0);
locks_delete_block(waiter);
return result;
}
@@ -596,7 +598,7 @@
{
int result;
locks_insert_block(blocker, waiter);
- result = interruptible_sleep_on_locked(&waiter->fl_wait, &file_lock_sem, time);
+ result = interruptible_sleep_on_locked(&waiter->fl_wait, time);
locks_delete_block(waiter);
return result;
}
@@ -606,14 +608,14 @@
{
struct file_lock *cfl;

- acquire_fl_sem();
+ acquire_fl_lock();
for (cfl = filp->f_dentry->d_inode->i_flock; cfl; cfl = cfl->fl_next) {
if (!(cfl->fl_flags & FL_POSIX))
continue;
if (posix_locks_conflict(cfl, fl))
break;
}
- release_fl_sem();
+ release_fl_lock();

return (cfl);
}
@@ -668,14 +670,14 @@
/*
* Search the lock list for this inode for any POSIX locks.
*/
- acquire_fl_sem();
+ acquire_fl_lock();
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (!(fl->fl_flags & FL_POSIX))
continue;
if (fl->fl_owner != owner)
break;
}
- release_fl_sem();
+ release_fl_lock();
return fl ? -EAGAIN : 0;
}

@@ -696,7 +698,7 @@
new_fl->fl_end = offset + count - 1;

error = 0;
- acquire_fl_sem();
+ acquire_fl_lock();

repeat:
/* Search the lock list for this inode for locks that conflict with
@@ -729,7 +731,7 @@
}
}
locks_free_lock(new_fl);
- release_fl_sem();
+ release_fl_lock();
return error;
}

@@ -757,6 +759,8 @@
return error;
}

+ acquire_fl_lock();
+
error = 0;
search:
change = 0;
@@ -803,6 +807,7 @@
error = 0;

out:
+ release_fl_lock();
if (new_fl)
locks_free_lock(new_fl);
return error;
@@ -847,7 +852,7 @@
if (!(new_fl && new_fl2))
goto out;

- acquire_fl_sem();
+ acquire_fl_lock();
if (caller->fl_type != F_UNLCK) {
repeat:
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
@@ -992,7 +997,7 @@
locks_wake_up_blocks(left, 0);
}
out:
- release_fl_sem();
+ release_fl_lock();
/*
* Free any unused locks.
*/
@@ -1034,11 +1039,14 @@
int error = 0, future;
struct file_lock *new_fl, *flock;
struct file_lock *fl;
- int alloc_err;

- alloc_err = lease_alloc(NULL, 0, &new_fl);
+ new_fl = lease_alloc(NULL, 0);
+ if (new_fl == NULL) {
+ error = -ENOMEM;
+ goto out_no_fl;
+ }

- acquire_fl_sem();
+ acquire_fl_lock();
flock = inode->i_flock;
if (flock->fl_type & F_INPROGRESS) {
if ((mode & O_NONBLOCK)
@@ -1046,10 +1054,6 @@
error = -EWOULDBLOCK;
goto out;
}
- if (alloc_err != 0) {
- error = alloc_err;
- goto out;
- }
do {
error = locks_block_on(flock, new_fl);
if (error != 0)
@@ -1071,11 +1075,6 @@
goto out;
}

- if (alloc_err && (flock->fl_owner != current->files)) {
- error = alloc_err;
- goto out;
- }
-
fl = flock;
do {
fl->fl_type = future;
@@ -1107,9 +1106,9 @@
}

out:
- release_fl_sem();
- if (!alloc_err)
- locks_free_lock(new_fl);
+ release_fl_lock();
+ locks_free_lock(new_fl);
+out_no_fl:
return error;
}

@@ -1152,7 +1151,12 @@
return fl->fl_type & ~F_INPROGRESS;
}

-/* We already had a lease on this file; just change its type */
+/*
+ * We already had a lease on this file; just change its type
+ *
+ * This function may not sleep.
+ */
+
static int lease_modify(struct file_lock **before, int arg, int fd, struct file *filp)
{
struct file_lock *fl = *before;
@@ -1187,7 +1191,7 @@
*/
int fcntl_setlease(unsigned int fd, struct file *filp, long arg)
{
- struct file_lock *fl, **before, **my_before = NULL;
+ struct file_lock *fl, *new_fl, **before, **my_before = NULL;
struct dentry *dentry;
struct inode *inode;
int error, rdlease_count = 0, wrlease_count = 0;
@@ -1210,7 +1214,13 @@

before = &inode->i_flock;

- acquire_fl_sem();
+ new_fl = lease_alloc(filp, arg);
+ if (new_fl == NULL) {
+ error = -ENOMEM;
+ goto out;
+ }
+
+ acquire_fl_lock();

while ((fl = *before) != NULL) {
if (fl->fl_flags != FL_LEASE)
@@ -1245,9 +1255,8 @@
goto out_unlock;
}

- error = lease_alloc(filp, arg, &fl);
- if (error)
- goto out_unlock;
+ fl = new_fl;
+ new_fl = NULL;

error = fasync_helper(fd, filp, 1, &fl->fl_fasync);
if (error < 0) {
@@ -1261,7 +1270,10 @@
filp->f_owner.uid = current->uid;
filp->f_owner.euid = current->euid;
out_unlock:
- release_fl_sem();
+ release_fl_lock();
+out:
+ if (new_fl)
+ locks_free_lock(new_fl);
return error;
}

@@ -1307,10 +1319,8 @@
&& !(filp->f_mode & 3))
goto out_putf;

- acquire_fl_sem();
error = flock_lock_file(filp, type,
(cmd & (LOCK_UN | LOCK_NB)) ? 0 : 1);
- release_fl_sem();

out_putf:
fput(filp);
@@ -1643,7 +1653,7 @@
*/
return;
}
- acquire_fl_sem();
+ acquire_fl_lock();
before = &inode->i_flock;
while ((fl = *before) != NULL) {
if ((fl->fl_flags & FL_POSIX) && fl->fl_owner == owner) {
@@ -1652,7 +1662,7 @@
}
before = &fl->fl_next;
}
- release_fl_sem();
+ release_fl_lock();
}

/*
@@ -1667,7 +1677,7 @@
if (!inode->i_flock)
return;

- acquire_fl_sem();
+ acquire_fl_lock();
before = &inode->i_flock;

while ((fl = *before) != NULL) {
@@ -1678,7 +1688,7 @@
}
before = &fl->fl_next;
}
- release_fl_sem();
+ release_fl_lock();
}

/**
@@ -1691,9 +1701,9 @@
void
posix_block_lock(struct file_lock *blocker, struct file_lock *waiter)
{
- acquire_fl_sem();
+ acquire_fl_lock();
locks_insert_block(blocker, waiter);
- release_fl_sem();
+ release_fl_lock();
}

/**
@@ -1701,16 +1711,32 @@
* @waiter: the lock which was waiting
*
* lockd needs to block waiting for locks.
+ * This function locks the locklist and calls
+ * __posix_unblock_lock()
*/
void
posix_unblock_lock(struct file_lock *waiter)
{
- acquire_fl_sem();
+ acquire_fl_lock();
+ __posix_unblock_lock(waiter);
+ release_fl_lock();
+}
+
+/**
+ * __posix_unblock_lock - stop waiting for a file lock
+ * @waiter: the lock which was waiting
+ *
+ * lockd needs to block waiting for locks.
+ * Does not lock the locklist. This is for use within
+ * the file_lock.fl_notify() callback.
+ */
+void
+__posix_unblock_lock(struct file_lock *waiter)
+{
if (!list_empty(&waiter->fl_block)) {
locks_delete_block(waiter);
wake_up(&waiter->fl_wait);
}
- release_fl_sem();
}

static void lock_get_status(char* out, struct file_lock *fl, int id, char *pfx)
@@ -1801,7 +1827,7 @@
off_t pos = 0;
int i = 0;

- acquire_fl_sem();
+ acquire_fl_lock();
list_for_each(tmp, &file_lock_list) {
struct list_head *btmp;
struct file_lock *fl = list_entry(tmp, struct file_lock, fl_link);
@@ -1822,7 +1848,7 @@
}
}
done:
- release_fl_sem();
+ release_fl_lock();
*start = buffer;
if(q-buffer < length)
return (q-buffer);
@@ -1847,7 +1873,7 @@
{
struct file_lock *fl;
int result = 1;
- acquire_fl_sem();
+ acquire_fl_lock();
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (fl->fl_flags == FL_POSIX) {
if (fl->fl_type == F_RDLCK)
@@ -1864,7 +1890,7 @@
result = 0;
break;
}
- release_fl_sem();
+ release_fl_lock();
return result;
}

@@ -1885,7 +1911,7 @@
{
struct file_lock *fl;
int result = 1;
- acquire_fl_sem();
+ acquire_fl_lock();
for (fl = inode->i_flock; fl != NULL; fl = fl->fl_next) {
if (fl->fl_flags == FL_POSIX) {
if ((fl->fl_end < start) || (fl->fl_start > (start + len)))
@@ -1900,7 +1926,7 @@
result = 0;
break;
}
- release_fl_sem();
+ release_fl_lock();
return result;
}
#endif
--- linux-2.4.0-test10-pre5/include/linux/fs.h Tue Oct 24 21:34:13 2000
+++ linux-akpm/include/linux/fs.h Sun Oct 29 01:45:46 2000
@@ -546,7 +546,7 @@
#endif

extern struct list_head file_lock_list;
-extern struct semaphore file_lock_sem;
+extern spinlock_t file_lock_lock;

#include <linux/fcntl.h>

@@ -564,6 +564,7 @@
extern struct file_lock *posix_test_lock(struct file *, struct file_lock *);
extern int posix_lock_file(struct file *, struct file_lock *, unsigned int);
extern void posix_block_lock(struct file_lock *, struct file_lock *);
+extern void __posix_unblock_lock(struct file_lock *);
extern void posix_unblock_lock(struct file_lock *);
extern int __get_lease(struct inode *inode, unsigned int flags);
extern time_t lease_get_mtime(struct inode *);
--- linux-2.4.0-test10-pre5/kernel/ksyms.c Sun Oct 15 01:27:46 2000
+++ linux-akpm/kernel/ksyms.c Sat Oct 28 23:14:13 2000
@@ -215,7 +215,7 @@
EXPORT_SYMBOL(page_hash_bits);
EXPORT_SYMBOL(page_hash_table);
EXPORT_SYMBOL(file_lock_list);
-EXPORT_SYMBOL(file_lock_sem);
+EXPORT_SYMBOL(file_lock_lock);
EXPORT_SYMBOL(locks_init_lock);
EXPORT_SYMBOL(locks_copy_lock);
EXPORT_SYMBOL(posix_lock_file);
--- linux-2.4.0-test10-pre5/fs/lockd/svclock.c Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/lockd/svclock.c Sun Oct 29 01:42:47 2000
@@ -456,7 +456,7 @@
struct nlm_block **bp, *block;

dprintk("lockd: VFS unblock notification for block %p\n", fl);
- posix_unblock_lock(fl);
+ __posix_unblock_lock(fl);
for (bp = &nlm_blocked; (block = *bp); bp = &block->b_next) {
if (nlm_compare_locks(&block->b_call.a_args.lock.fl, fl)) {
svc_wake_up(block->b_daemon);
--- linux-2.4.0-test10-pre5/Documentation/filesystems/Locking Fri Aug 11 19:06:11 2000
+++ linux-akpm/Documentation/filesystems/Locking Sun Oct 29 02:25:03 2000
@@ -159,10 +159,10 @@
void (*fl_remove)(struct file_lock *); /* lock removal callback */

locking rules:
- BKL may block
-fl_notify: yes no
-fl_insert: yes maybe
-fl_remove: yes maybe
+ BKL file_lock_lock may block
+fl_notify: no already held no
+fl_insert: no already held no
+fl_remove: no already held no
Currently only NLM provides instances of this class. None of the
them block. If you have out-of-tree instances - please, show up. Locking
in that area will change.

2000-10-28 15:58:32

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

On Sun, Oct 29, 2000 at 02:46:14AM +1100, Andrew Morton wrote:
> [email protected] wrote:
> >
> > Change the following two macros:
> > acquire_fl_sem()->lock_kernel()
> > release_fl_sem()->unlock_kernel()
> > then
> > 5192 Req/s @8cpu is got. It is same as test8 within fluctuation.
>
> hmm.. BKL increases scalability. News at 11.
>
> The big question is: why is Apache using file locking so
> much? Is this normal behaviour for Apache?

It serializes accept() to avoid the thundering herd from the wake-all
semantics.

With the 2.4 stack that is probably not needed anymore (it was in 2.2),
it may just work to remove the file locking (it should always be correct,
just on 2.2 it may be slower to remove it)

> Because if so, the file locking code will be significantly
> bad for the scalability of Apache on SMP (of all things!).
> It basically grabs a big global lock for _anything_. It
> looks like it could be a lot more granular.

iirc everybody who looked at the code agrees that it needs a rewrite
badly.

-Andi

2000-10-28 16:07:24

by Jeff Garzik

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Andrew Morton wrote:
> --- linux-2.4.0-test10-pre5/fs/locks.c Tue Oct 24 21:34:13 2000
> +++ linux-akpm/fs/locks.c Sun Oct 29 02:31:10 2000
> @@ -125,10 +125,9 @@
> #include <asm/semaphore.h>
> #include <asm/uaccess.h>
>
> -DECLARE_MUTEX(file_lock_sem);
> -
> -#define acquire_fl_sem() down(&file_lock_sem)
> -#define release_fl_sem() up(&file_lock_sem)
> +spinlock_t file_lock_lock = SPIN_LOCK_UNLOCKED;
> +#define acquire_fl_lock() spin_lock(&file_lock_lock);
> +#define release_fl_lock() spin_unlock(&file_lock_lock);

It seems like better concurrency could be achieved with reader-writer
locks. Some of the lock test routines simply scan the list, without
modifying it.

--
Jeff Garzik | "Mind if I drive?" -Sam
Building 1024 | "Not if you don't mind me clawing at the
MandrakeSoft | dash and screaming like a cheerleader."
| -Max

2000-10-28 16:21:08

by Alan

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

> The big question is: why is Apache using file locking so
> much? Is this normal behaviour for Apache?

Apache uses file locking to serialize accept on hosts where accept either has
bad thundering heard problems or was simply broken with multiple acceptors

2000-10-28 16:47:03

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Andrew Morton wrote:
>
> I think it's more expedient at this time to convert
> acquire_fl_sem/release_fl_sem into lock_kernel/unlock_kernel
> (so we _can_ sleep) and to fix the above alleged deadlock
> via the creation of __posix_unblock_lock()

I agree with me. Could you please test the scalability
of this?

--- linux-2.4.0-test10-pre5/fs/locks.c Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/locks.c Sun Oct 29 03:34:15 2000
@@ -1706,11 +1706,25 @@
posix_unblock_lock(struct file_lock *waiter)
{
acquire_fl_sem();
+ __posix_unblock_lock(waiter);
+ release_fl_sem();
+}
+
+/**
+ * __posix_unblock_lock - stop waiting for a file lock
+ * @waiter: the lock which was waiting
+ *
+ * lockd needs to block waiting for locks.
+ * Like posix_unblock_lock(), except it doesn't
+ * acquire the file lock semaphore.
+ */
+void
+__posix_unblock_lock(struct file_lock *waiter)
+{
if (!list_empty(&waiter->fl_block)) {
locks_delete_block(waiter);
wake_up(&waiter->fl_wait);
}
- release_fl_sem();
}

static void lock_get_status(char* out, struct file_lock *fl, int id, char *pfx)
--- linux-2.4.0-test10-pre5/include/linux/fs.h Tue Oct 24 21:34:13 2000
+++ linux-akpm/include/linux/fs.h Sun Oct 29 03:32:00 2000
@@ -564,6 +564,7 @@
extern struct file_lock *posix_test_lock(struct file *, struct file_lock *);
extern int posix_lock_file(struct file *, struct file_lock *, unsigned int);
extern void posix_block_lock(struct file_lock *, struct file_lock *);
+extern void __posix_unblock_lock(struct file_lock *);
extern void posix_unblock_lock(struct file_lock *);
extern int __get_lease(struct inode *inode, unsigned int flags);
extern time_t lease_get_mtime(struct inode *);
--- linux-2.4.0-test10-pre5/kernel/ksyms.c Sun Oct 15 01:27:46 2000
+++ linux-akpm/kernel/ksyms.c Sun Oct 29 03:32:38 2000
@@ -221,6 +221,7 @@
EXPORT_SYMBOL(posix_lock_file);
EXPORT_SYMBOL(posix_test_lock);
EXPORT_SYMBOL(posix_block_lock);
+EXPORT_SYMBOL(__posix_unblock_lock);
EXPORT_SYMBOL(posix_unblock_lock);
EXPORT_SYMBOL(locks_mandatory_area);
EXPORT_SYMBOL(dput);
--- linux-2.4.0-test10-pre5/fs/lockd/svclock.c Tue Oct 24 21:34:13 2000
+++ linux-akpm/fs/lockd/svclock.c Sun Oct 29 03:32:22 2000
@@ -456,7 +456,7 @@
struct nlm_block **bp, *block;

dprintk("lockd: VFS unblock notification for block %p\n", fl);
- posix_unblock_lock(fl);
+ __posix_unblock_lock(fl);
for (bp = &nlm_blocked; (block = *bp); bp = &block->b_next) {
if (nlm_compare_locks(&block->b_call.a_args.lock.fl, fl)) {
svc_wake_up(block->b_daemon);
--- linux-2.4.0-test10-pre5/fs/fcntl.c Sun Oct 15 01:27:45 2000
+++ linux-akpm/fs/fcntl.c Sun Oct 29 03:35:47 2000
@@ -254,11 +254,15 @@
unlock_kernel();
break;
case F_GETLK:
+ lock_kernel();
err = fcntl_getlk(fd, (struct flock *) arg);
+ unlock_kernel();
break;
case F_SETLK:
case F_SETLKW:
+ lock_kernel();
err = fcntl_setlk(fd, cmd, (struct flock *) arg);
+ unlock_kernel();
break;
case F_GETOWN:
/*

2000-10-29 19:46:13

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

On Sat, 28 Oct 2000, Alan Cox wrote:

> > The big question is: why is Apache using file locking so
> > much? Is this normal behaviour for Apache?
>
> Apache uses file locking to serialize accept on hosts where accept either has
> bad thundering heard problems or was simply broken with multiple acceptors

if apache 1.3 is compiled with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT it'll
avoid the fcntl() serialisation when there is only one listening port.
(it still uses it for multiple listeners... you can read all about my
logic for that at <http://www.apache.org/docs/misc/perf-tuning.html>.)

is it appropriate for this to be defined for newer linux kernels? i
haven't kept track, sorry. tell me what versions to conditionalize it on.

-dean

2000-10-30 06:31:12

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

On Sun, Oct 29, 2000 at 11:45:49AM -0800, dean gaudet wrote:
> On Sat, 28 Oct 2000, Alan Cox wrote:
>
> > > The big question is: why is Apache using file locking so
> > > much? Is this normal behaviour for Apache?
> >
> > Apache uses file locking to serialize accept on hosts where accept either has
> > bad thundering heard problems or was simply broken with multiple acceptors
>
> if apache 1.3 is compiled with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT it'll
> avoid the fcntl() serialisation when there is only one listening port.
> (it still uses it for multiple listeners... you can read all about my
> logic for that at <http://www.apache.org/docs/misc/perf-tuning.html>.)
>
> is it appropriate for this to be defined for newer linux kernels? i
> haven't kept track, sorry. tell me what versions to conditionalize it on.

It should not be needed anymore for 2.4, because the accept() wakeup has been
fixed.

-Andi

2000-10-30 09:28:38

by kumon

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Andrew Morton writes:
>
> I agree with me. Could you please test the scalability
> of this?

Here is the result, measured on 8-way profusion.

Andrew posted two paches, so called P1 and P2.

Req/s
test10-pre5: 2255 bad performance
----
test9+P2: 5243
test10-pre5+P1: 5187
test10-pre5+P2: 5258

P2 may be a little bit better.

----------
Data summary sorted by the performance:
test8 5287 <-- best performance
test10-pre5+P2: 5258
test9+P2: 5243
test9+mypatch: 5192 <-- a little bit worse
test10-pre5+P1: 5187
test1 3702 <-- no good scalability
test10-pre5: 2255 <-- negative scalability
test9 2193

The value changes within 30-50 seems fluctuations.

--
Computer Systems Laboratory, Fujitsu Labs.
[email protected]

2000-10-30 15:00:58

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

[email protected] wrote:
>
> Andrew Morton writes:
> >
> > I agree with me. Could you please test the scalability
> > of this?
>
> Here is the result, measured on 8-way profusion.

Thank you!

> Andrew posted two paches, so called P1 and P2.

Was `P2' the shorter one? It looks like it.

> Req/s
> test10-pre5: 2255 bad performance
> ----
> test9+P2: 5243
> test10-pre5+P1: 5187
> test10-pre5+P2: 5258
>
> P2 may be a little bit better.

I'd be interested in seeing the -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT
figures.

Dean, it looks like the same problem will occur with flock()-based
serialisation. Does Apache/Linux ever use that option?

2000-10-30 15:32:54

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

On Mon, Oct 30, 2000 at 07:29:51AM +0100, Andi Kleen wrote:
> It should not be needed anymore for 2.4, because the accept() wakeup has been
> fixed.

Certainly sleeping in accept will be just way better than file any locking.

OTOH accept() is still _wrong_ as it wake-one FIFO while it should wake-one
LIFO (so that we optimize the cache usage skip TLB flushes and allow the
redundand tasks to be paged out). I can only see cons in doing FIFO wake-one.

Andrea

2000-10-30 16:39:13

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

On Mon, 30 Oct 2000, Andrea Arcangeli wrote:
> On Mon, Oct 30, 2000 at 07:29:51AM +0100, Andi Kleen wrote:
> > It should not be needed anymore for 2.4, because the accept() wakeup has been
> > fixed.
>
> Certainly sleeping in accept will be just way better than file any locking.
>
> OTOH accept() is still _wrong_ as it wake-one FIFO while it
> should wake-one LIFO (so that we optimize the cache usage skip
> TLB flushes and allow the redundand tasks to be paged out). I
> can only see cons in doing FIFO wake-one.

LIFO wakeup is indeed the way to go for things like accept().

For stuff like ___wait_on_page(), OTOH, you really want FIFO
wakeup to avoid starvation (yes, I know we're currently doing
wake_all in ___wait_on_page ;))...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
-- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/ http://www.surriel.com/

2000-10-30 18:03:14

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:

On Mon, Oct 30, 2000 at 02:36:39PM -0200, Rik van Riel wrote:
> For stuff like ___wait_on_page(), OTOH, you really want FIFO
> wakeup to avoid starvation (yes, I know we're currently doing

Sure agreed. In my _whole_ previous email I was only talking about accept.
Semaphores file locking etc.. all needs FIFO for fairness as you said.

Andrea

2000-10-30 23:24:53

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

On Tue, 31 Oct 2000, Andrew Morton wrote:

> Dean, it looks like the same problem will occur with flock()-based
> serialisation. Does Apache/Linux ever use that option?

from apache/src/include/ap_config.h in the linux section there's
this:

/* flock is faster ... but hasn't been tested on 1.x systems */
/* PR#3531 indicates flock() may not be stable, probably depends on
* kernel version. Go back to using fcntl, but provide a way for
* folks to tweak their Configuration to get flock.
*/
#ifndef USE_FLOCK_SERIALIZED_ACCEPT
#define USE_FCNTL_SERIALIZED_ACCEPT
#endif

so you should be able to -DUSE_FLOCK_SERIALIZED_ACCEPT to try it.

(flock is less desirable than fcntl in general because it requires the
lock file to be kept around for the duration of the server, you can't
unlock() it immediately after creat().)

-dean

2000-10-31 15:37:13

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Kouichi,

how many Apache processes are you running? MaxSpareServers?

This patch is a moderate rewrite of __wake_up_common. I'd be
interested in seeing how much difference it makes to the
performance of Apache when the file-locking serialisation is
disabled.

- It implements last-in/first-out semantics for waking
TASK_EXCLUSIVE tasks.

- It fixes what was surely a bug wherein __wake_up_common
scans the entire wait queue even when it has found the
task which it wants to run.

On a dual-CPU box it dramatically increases the max connection
rate when there are a large number of waiters:

#waiters conn/sec (t10-p5+patch) conn/sec (t10-p5)
30 5525 4990
100 5700 4100
1000 5600 1500

This will be due entirely to the queue scanning fix - my
test app has a negligible cache footprint.

It's stable, but it's a work-in-progress.

- __wake_up_common does a fair amount of SMP-specific
stuff when compiled for UP which needs sorting out

- it seems there's somebody in the networking code who
changes a task state incorrectly when it's on a wait
queue. This used to be OK, but it's not OK now that
I'm relying upon the wait queue being in the state
which it should be.

Thanks.

--- linux-2.4.0-test10-pre5/kernel/sched.c Sun Oct 15 01:27:46 2000
+++ linux-akpm/kernel/sched.c Wed Nov 1 01:54:44 2000
@@ -697,6 +697,53 @@
return;
}

+#if WAITQUEUE_DEBUG
+/*
+ * Check that the wait queue is in the correct order:
+ * !TASK_EXCLUSIVE at the head
+ * TASK_EXCLUSIVE at the tail
+ * The list is locked.
+ */
+
+static void check_wq_sanity(wait_queue_head_t *q)
+{
+ struct list_head *probe, *head;
+
+ head = &q->task_list;
+ probe = head->next;
+ while (probe != head) {
+ wait_queue_t *curr = list_entry(probe, wait_queue_t, task_list);
+ if (curr->task->state & TASK_EXCLUSIVE)
+ break;
+ probe = probe->next;
+ }
+ while (probe != head) {
+ wait_queue_t *curr = list_entry(probe, wait_queue_t, task_list);
+ if (!(curr->task->state & TASK_EXCLUSIVE)) {
+ printk("check_wq_sanity: mangled wait queue\n");
+#ifdef CONFIG_X86
+ show_stack(0);
+#endif
+ break;
+ }
+ probe = probe->next;
+ }
+}
+#endif /* WAITQUEUE_DEBUG */
+
+/*
+ * Wake up some tasks which are on *q.
+ *
+ * All tasks which are !TASK_EXCLUSIVE are woken.
+ * Only one TASK_EXCLUSIVE task is woken.
+ * The !TASK_EXCLUSIVE tasks start at q->task_list.next
+ * The TASK_EXCLUSIVE tasks start at q->task_list.prev
+ *
+ * When waking a TASK_EXCLUSIVE task we search backward,
+ * so we find the most-recently-added task (it will have the
+ * hottest cache)
+ */
+
static inline void __wake_up_common (wait_queue_head_t *q, unsigned int mode,
const int sync)
{
@@ -714,6 +761,7 @@
wq_write_lock_irqsave(&q->lock, flags);

#if WAITQUEUE_DEBUG
+ check_wq_sanity(q);
CHECK_MAGIC_WQHEAD(q);
#endif

@@ -722,6 +770,11 @@
if (!head->next || !head->prev)
WQ_BUG();
#endif
+
+ /*
+ * if (mode & TASK_EXCLUSIVE) Wake all the !TASK_EXCLUSIVE tasks
+ * if !(mode & TASK_EXCLUSIVE) Wake all the tasks
+ */
tmp = head->next;
while (tmp != head) {
unsigned int state;
@@ -734,40 +787,69 @@
#endif
p = curr->task;
state = p->state;
+ if (state & mode & TASK_EXCLUSIVE)
+ break; /* Finished all !TASK_EXCLUSIVEs */
if (state & (mode & ~TASK_EXCLUSIVE)) {
#if WAITQUEUE_DEBUG
curr->__waker = (long)__builtin_return_address(0);
#endif
- /*
- * If waking up from an interrupt context then
- * prefer processes which are affine to this
- * CPU.
- */
- if (irq && (state & mode & TASK_EXCLUSIVE)) {
+ if (sync)
+ wake_up_process_synchronous(p);
+ else
+ wake_up_process(p);
+ }
+ }
+
+ /*
+ * Now wake one TASK_EXCLUSIVE task.
+ */
+ if (mode & TASK_EXCLUSIVE) {
+ tmp = head->prev;
+ while (tmp != head) {
+ unsigned int state;
+ wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
+
+ tmp = tmp->prev;
+#if WAITQUEUE_DEBUG
+ CHECK_MAGIC(curr->__magic);
+#endif
+ p = curr->task;
+ state = p->state;
+ if (!(state & TASK_EXCLUSIVE))
+ break; /* Exhausted all TASK_EXCLUSIVEs */
+
+ if (state & mode) {
+ /*
+ * If waking up from an interrupt context then
+ * prefer processes which are affine to this
+ * CPU.
+ */
if (!best_exclusive)
best_exclusive = p;
- else if ((p->processor == best_cpu) &&
- (best_exclusive->processor != best_cpu))
+ if (irq) {
+ if (p->processor == best_cpu) {
best_exclusive = p;
- } else {
- if (sync)
- wake_up_process_synchronous(p);
- else
- wake_up_process(p);
- if (state & mode & TASK_EXCLUSIVE)
+ break;
+ }
+ } else {
break;
+ }
}
}
- }
- if (best_exclusive)
- best_exclusive->state = TASK_RUNNING;
- wq_write_unlock_irqrestore(&q->lock, flags);
-
- if (best_exclusive) {
- if (sync)
- wake_up_process_synchronous(best_exclusive);
- else
- wake_up_process(best_exclusive);
+
+ if (best_exclusive)
+ best_exclusive->state = TASK_RUNNING;
+
+ wq_write_unlock_irqrestore(&q->lock, flags);
+
+ if (best_exclusive) {
+ if (sync)
+ wake_up_process_synchronous(best_exclusive);
+ else
+ wake_up_process(best_exclusive);
+ }
+ } else {
+ wq_write_unlock_irqrestore(&q->lock, flags);
}
out:
return;
--- linux-2.4.0-test10-pre5/kernel/fork.c Sat Sep 9 16:19:30 2000
+++ linux-akpm/kernel/fork.c Wed Nov 1 02:08:00 2000
@@ -39,6 +39,14 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
+#if WAITQUEUE_DEBUG
+ if (wait->task.state & TASK_EXCLUSIVE) {
+ printk("add_wait_queue: exclusive task added at head!\n");
+#ifdef CONFIG_X86
+ show_stack(0);
+#endif
+ }
+#endif
__add_wait_queue(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}
@@ -48,6 +56,14 @@
unsigned long flags;

wq_write_lock_irqsave(&q->lock, flags);
+#if WAITQUEUE_DEBUG
+ if (!(wait->task.state & TASK_EXCLUSIVE)) {
+ printk("add_wait_queue: non-exclusive task added at tail!\n");
+#ifdef CONFIG_X86
+ show_stack(0);
+#endif
+ }
+#endif
__add_wait_queue_tail(q, wait);
wq_write_unlock_irqrestore(&q->lock, flags);
}

2000-11-01 01:03:15

by kumon

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Andrew Morton writes:
> how many Apache processes are you running? MaxSpareServers?

MinSpareServers 16
MaxSpareServers 20
StartServers 16
MaxClients 64
MaxRequestsPerChild 0

During busy duration, 64 server processes are working if requests are
processed without trouble.
32 client systems are used and 2 request threads are run on each.

> performance of Apache when the file-locking serialisation is
> disabled.

Making apache without locking is now on-going.
# LinuxWorld conference held in Tokyo these two days, may be an
# obstacle.

--
Computer Systems Laboratory, Fujitsu Labs.
[email protected]

2000-11-02 11:09:42

by kumon

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

Andrew Morton writes:
> This patch is a moderate rewrite of __wake_up_common. I'd be
> interested in seeing how much difference it makes to the
> performance of Apache when the file-locking serialisation is
> disabled.
> - It implements last-in/first-out semantics for waking
> TASK_EXCLUSIVE tasks.
> - It fixes what was surely a bug wherein __wake_up_common
> scans the entire wait queue even when it has found the
> task which it wants to run.

We've measured Apache w/ and w/o serialize_accept on
several kernel configurations.

Apache compilation settings are those two:
* without option (conventional setting)
(w/ serialize)
* with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT
(w/o serialize)

We compared the performance of distributed binary and our binary with
default setting, the performance is almost equivalent. All following
data are based on our binaries.

w/ serialize w/o serialize
2.40-t10-pre5 2237 5358
2.40-t10-pre5+P2 5253 5355**
2.40-t10-pre5+P3 --- NG

** with this configuration, once we observed the machine completely
deadlocked with the following message:

Unable to handle kernel NULL pointer dereferenceNMI watchdog detected LOCKUP on CPU1.

Actually, the NMI message followed few seconds after the NULL
dereference message.
The virtual address nor eip are not displayed.

> It's stable, but it's a work-in-progress.
> - __wake_up_common does a fair amount of SMP-specific
> stuff when compiled for UP which needs sorting out
> - it seems there's somebody in the networking code who
> changes a task state incorrectly when it's on a wait
> queue. This used to be OK, but it's not OK now that
> I'm relying upon the wait queue being in the state
> which it should be.

Your last patch makes the problem very clear.

2.4.0-test10-pre5 with the LIFO patch (P3), we can't get the values.
It always deadlock with same manner. Perhaps, it failed to get
console lock then deadlock.

What we can now is combining several printk's in the null-pointer
dereference message into one printk and make it display much usefull
data.

--
Computer Systems Laboratory, Fujitsu Labs.
[email protected]

2000-11-02 12:50:35

by kumon

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

[email protected] writes:
> Your last patch makes the problem very clear.
>
> 2.4.0-test10-pre5 with the LIFO patch (P3), we can't get the values.

It dies at the following line in kernel/sched.c:schedule()

move_rr_back:

switch (prev->state & ~TASK_EXCLUSIVE) {
case TASK_INTERRUPTIBLE:
if (signal_pending(prev)) {
prev->state = TASK_RUNNING;
break;
}
default:
here==> del_from_runqueue(prev);
case TASK_RUNNING:
}

"prev" contains a run_list structure and prev->run_list.next pointed
to NULL. This caused the bus-error in __list_del().

# actually the CPU accesses address 0x4 (4 is offset to prev within
# run_list).

BTW, why this switch statement has less breaks at everywhere.

--
Computer Systems Laboratory, Fujitsu Labs.
[email protected]

2000-11-04 05:07:43

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was: Strange performance behavior of 2.4.0-test9)

--- linux-2.4.0-test10-pre5/include/linux/wait.h Sun Oct 15 01:27:46 2000
+++ linux-akpm/include/linux/wait.h Fri Nov 3 21:14:30 2000
@@ -14,45 +14,24 @@
#include <linux/list.h>
#include <linux/stddef.h>
#include <linux/spinlock.h>
+#include <linux/cache.h>

#include <asm/page.h>
#include <asm/processor.h>

/*
- * Temporary debugging help until all code is converted to the new
- * waitqueue usage.
+ * Debugging control. Slow and useful.
*/
#define WAITQUEUE_DEBUG 0

-#if WAITQUEUE_DEBUG
-extern int printk(const char *fmt, ...);
-#define WQ_BUG() do { \
- printk("wq bug, forcing oops.\n"); \
- BUG(); \
-} while (0)
-
-#define CHECK_MAGIC(x) if (x != (long)&(x)) \
- { printk("bad magic %lx (should be %lx), ", (long)x, (long)&(x)); WQ_BUG(); }
-
-#define CHECK_MAGIC_WQHEAD(x) do { \
- if (x->__magic != (long)&(x->__magic)) { \
- printk("bad magic %lx (should be %lx, creator %lx), ", \
- x->__magic, (long)&(x->__magic), x->__creator); \
- WQ_BUG(); \
- } \
-} while (0)
-#endif
-
-struct __wait_queue {
- unsigned int compiler_warning;
- struct task_struct * task;
- struct list_head task_list;
-#if WAITQUEUE_DEBUG
- long __magic;
- long __waker;
+/*
+ * Used for ifdef reduction
+ */
+#ifdef CONFIG_SMP
+# define WQ_SMP 1
+#else
+# define WQ_SMP 0
#endif
-};
-typedef struct __wait_queue wait_queue_t;

/*
* 'dual' spinlock architecture. Can be switched between spinlock_t and
@@ -88,38 +67,105 @@
# define wq_write_unlock spin_unlock
#endif

+/*
+ * Data types
+ */
+
+struct __wait_queue {
+ struct task_struct * task;
+ struct list_head tasklist;
+#if WAITQUEUE_DEBUG
+ long __magic;
+ long __waker;
+#endif
+} ____cacheline_aligned;
+typedef struct __wait_queue wait_queue_t;
+
struct __wait_queue_head {
wq_lock_t lock;
- struct list_head task_list;
+ struct list_head tasklist;
+ struct list_head separator;
#if WAITQUEUE_DEBUG
long __magic;
long __creator;
#endif
-};
+} ____cacheline_aligned;
typedef struct __wait_queue_head wait_queue_head_t;

+/*
+ * Debugging macros. We eschew `do { } while (0)' because gcc can generate
+ * spurious .aligns.
+ */
#if WAITQUEUE_DEBUG
-# define __WAITQUEUE_DEBUG_INIT(name) \
- , (long)&(name).__magic, 0
-# define __WAITQUEUE_HEAD_DEBUG_INIT(name) \
- , (long)&(name).__magic, (long)&(name).__magic
+#define WQ_BUG() BUG()
+#define CHECK_MAGIC(x) \
+ do { \
+ if ((x) != (long)&(x)) { \
+ printk("bad magic %lx (should be %lx), ", \
+ (long)x, (long)&(x)); \
+ WQ_BUG(); \
+ } \
+ } while (0)
+#define CHECK_MAGIC_WQHEAD(x) \
+ do { \
+ if ((x)->__magic != (long)&((x)->__magic)) { \
+ printk("bad magic %lx (should be %lx, creator %lx), ", \
+ (x)->__magic, (long)&((x)->__magic), (x)->__creator); \
+ WQ_BUG(); \
+ } \
+ } while (0)
+#define WQ_CHECK_LIST_HEAD(list) \
+ do { \
+ if (!list->next || !list->prev) \
+ WQ_BUG(); \
+ } while(0)
+#define WQ_NOTE_WAKER(tsk) \
+ do { \
+ tsk->__waker = (long)__builtin_return_address(0); \
+ } while (0)
+#else
+#define WQ_BUG()
+#define CHECK_MAGIC(x)
+#define CHECK_MAGIC_WQHEAD(x)
+#define WQ_CHECK_LIST_HEAD(list)
+#define WQ_NOTE_WAKER(tsk)
+#define check_wq_sanity(q)
+#endif
+
+/*
+ * Macros for declaration and initialisaton of the datatypes
+ */
+
+#if WAITQUEUE_DEBUG
+# define __WAITQUEUE_DEBUG_INIT(name) (long)&(name).__magic, 0
+# define __WAITQUEUE_HEAD_DEBUG_INIT(name) (long)&(name).__magic, (long)&(name).__magic
#else
# define __WAITQUEUE_DEBUG_INIT(name)
# define __WAITQUEUE_HEAD_DEBUG_INIT(name)
#endif

-#define __WAITQUEUE_INITIALIZER(name,task) \
- { 0x1234567, task, { NULL, NULL } __WAITQUEUE_DEBUG_INIT(name)}
-#define DECLARE_WAITQUEUE(name,task) \
- wait_queue_t name = __WAITQUEUE_INITIALIZER(name,task)
-
-#define __WAIT_QUEUE_HEAD_INITIALIZER(name) \
-{ WAITQUEUE_RW_LOCK_UNLOCKED, { &(name).task_list, &(name).task_list } \
- __WAITQUEUE_HEAD_DEBUG_INIT(name)}
+#define __WAITQUEUE_INITIALIZER(name, tsk) { \
+ task: tsk, \
+ tasklist: { NULL, NULL }, \
+ __WAITQUEUE_DEBUG_INIT(name)}
+
+#define DECLARE_WAITQUEUE(name, tsk) \
+ wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)
+
+#define __WAIT_QUEUE_HEAD_INITIALIZER(name) { \
+ lock: WAITQUEUE_RW_LOCK_UNLOCKED, \
+ tasklist: { &(name).separator, &(name).separator }, \
+ separator: { &(name).tasklist, &(name).tasklist }, \
+ __WAITQUEUE_HEAD_DEBUG_INIT(name)}

#define DECLARE_WAIT_QUEUE_HEAD(name) \
wait_queue_head_t name = __WAIT_QUEUE_HEAD_INITIALIZER(name)

+
+/*
+ * Inline functions
+ */
+
static inline void init_waitqueue_head(wait_queue_head_t *q)
{
#if WAITQUEUE_DEBUG
@@ -127,7 +173,10 @@
WQ_BUG();
#endif
q->lock = WAITQUEUE_RW_LOCK_UNLOCKED;
- INIT_LIST_HEAD(&q->task_list);
+ q->tasklist.next = &q->separator;
+ q->tasklist.prev = &q->separator;
+ q->separator.next = &q->tasklist;
+ q->separator.prev = &q->tasklist;
#if WAITQUEUE_DEBUG
q->__magic = (long)&q->__magic;
q->__creator = (long)current_text_addr();
@@ -155,7 +204,7 @@
CHECK_MAGIC_WQHEAD(q);
#endif

- return !list_empty(&q->task_list);
+ return q->tasklist.next != q->tasklist.prev;
}

static inline void __add_wait_queue(wait_queue_head_t *head, wait_queue_t *new)
@@ -165,10 +214,12 @@
WQ_BUG();
CHECK_MAGIC_WQHEAD(head);
CHECK_MAGIC(new->__magic);
- if (!head->task_list.next || !head->task_list.prev)
+ if (!head->tasklist.next || !head->tasklist.prev)
+ WQ_BUG();
+ if (!head->separator.next || !head->separator.prev)
WQ_BUG();
#endif
- list_add(&new->task_list, &head->task_list);
+ list_add(&new->tasklist, &head->tasklist);
}

/*
@@ -182,10 +233,12 @@
WQ_BUG();
CHECK_MAGIC_WQHEAD(head);
CHECK_MAGIC(new->__magic);
- if (!head->task_list.next || !head->task_list.prev)
+ if (!head->tasklist.next || !head->tasklist.prev)
+ WQ_BUG();
+ if (!head->separator.next || !head->separator.prev)
WQ_BUG();
#endif
- list_add_tail(&new->task_list, &head->task_list);
+ list_add_tail(&new->tasklist, &head->tasklist);
}

static inline void __remove_wait_queue(wait_queue_head_t *head,
@@ -196,7 +249,8 @@
WQ_BUG();
CHECK_MAGIC(old->__magic);
#endif
- list_del(&old->task_list);
+ list_del(&old->tasklist);
+ old->task = 0; /* AKPM: this is temporary */
}

#endif /* __KERNEL__ */
--- linux-2.4.0-test10-pre5/kernel/sched.c Sun Oct 15 01:27:46 2000
+++ linux-akpm/kernel/sched.c Fri Nov 3 20:26:23 2000
@@ -697,78 +697,166 @@
return;
}

+#if WAITQUEUE_DEBUG
+static void dump_wq(wait_queue_head_t *q)
+{
+ struct list_head *probe, *head, *separator;
+ int i = 0;
+
+ head = &q->tasklist;
+ separator = &q->separator;
+ probe = head->next;
+ while (probe != head) {
+ if (probe == separator) {
+ printk("%d: separator\n", i);
+ } else {
+ wait_queue_t *curr = list_entry(probe, wait_queue_t, tasklist);
+ struct task_struct *tsk = curr->task;
+ printk("%d: pid=%d, state=0x%lx\n", i, tsk->pid, tsk->state);
+ }
+ probe = probe->next;
+ i++;
+ }
+}
+
+/*
+ * Check that the wait queue is in the correct order:
+ * !TASK_EXCLUSIVE at the head
+ * TASK_EXCLUSIVE at the tail
+ * We know that the list is locked.
+ * This function generates bogus results because of harmless races elsewhere.
+ */
+
+static void check_wq_sanity(wait_queue_head_t *q)
+{
+ struct list_head *probe, *head, *separator;
+
+ head = &q->tasklist;
+ separator = &q->separator;
+ probe = head->next;
+ while (probe != head) {
+ if (probe != separator) {
+ wait_queue_t *curr = list_entry(probe, wait_queue_t, tasklist);
+ if (curr->task->state & TASK_EXCLUSIVE)
+ break;
+ }
+ probe = probe->next;
+ }
+ while (probe != head) {
+ if (probe != separator) {
+ wait_queue_t *curr = list_entry(probe, wait_queue_t, tasklist);
+ if (!(curr->task->state & TASK_EXCLUSIVE)) {
+ printk("check_wq_sanity: mangled wait queue\n");
+#ifdef CONFIG_X86
+ show_stack(0);
+#endif
+ dump_wq(q);
+ }
+ }
+ probe = probe->next;
+ }
+}
+#endif /* WAITQUEUE_DEBUG */
+
+/*
+ * Wake up some tasks which are on *q.
+ *
+ * All tasks which are non-exclusive are woken.
+ * Only one exclusive task is woken.
+ * The non-eclusive tasks start at q->tasklist.next and end at q->separator
+ * The exclusive tasks start at q->tasklist.prev and end at q->separator
+ *
+ * When waking an exclusive task we search backward,
+ * so we find the most-recently-added task (it will have the
+ * hottest cache)
+ */
+
static inline void __wake_up_common (wait_queue_head_t *q, unsigned int mode,
const int sync)
{
- struct list_head *tmp, *head;
- struct task_struct *p, *best_exclusive;
+ struct list_head *tmp, *head, *separator;
+ struct task_struct *p;
unsigned long flags;
int best_cpu, irq;
+ unsigned int exclusive_mode;

if (!q)
goto out;

- best_cpu = smp_processor_id();
- irq = in_interrupt();
- best_exclusive = NULL;
+ if (WQ_SMP) {
+ best_cpu = smp_processor_id();
+ irq = in_interrupt();
+ }
+ exclusive_mode = mode & TASK_EXCLUSIVE;
+ mode &= ~TASK_EXCLUSIVE;
+
wq_write_lock_irqsave(&q->lock, flags);

-#if WAITQUEUE_DEBUG
+ check_wq_sanity(q);
CHECK_MAGIC_WQHEAD(q);
-#endif

- head = &q->task_list;
-#if WAITQUEUE_DEBUG
- if (!head->next || !head->prev)
- WQ_BUG();
-#endif
+ head = &q->tasklist;
+ separator = &q->separator;
+ WQ_CHECK_LIST_HEAD(head);
+
+ /*
+ * Wake all the !TASK_EXCLUSIVE tasks
+ */
tmp = head->next;
- while (tmp != head) {
- unsigned int state;
- wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
+ while (tmp != separator) {
+ wait_queue_t *curr = list_entry(tmp, wait_queue_t, tasklist);

tmp = tmp->next;
+ CHECK_MAGIC(curr->__magic);
+ p = curr->task;
+ if (p->state & mode) {
+ WQ_NOTE_WAKER(curr);
+ if (sync)
+ wake_up_process_synchronous(p);
+ else
+ wake_up_process(p);
+ }
+ }

-#if WAITQUEUE_DEBUG
+ /*
+ * If (mode & TASK_EXCLUSIVE) we wake one exclusive task.
+ * If !(mode & TASK_EXCLUSIVE) we wake all exclusive tasks.
+ *
+ * If waking up exclusively from an interrupt context then
+ * prefer tasks which are affine to this CPU.
+ * If we can't find an affine task then wake up the
+ * least-recently queued one.
+ */
+ tmp = head->prev;
+ while (tmp != separator) {
+ wait_queue_t *curr = list_entry(tmp, wait_queue_t, tasklist);
+
+ tmp = tmp->prev;
CHECK_MAGIC(curr->__magic);
-#endif
p = curr->task;
- state = p->state;
- if (state & (mode & ~TASK_EXCLUSIVE)) {
-#if WAITQUEUE_DEBUG
- curr->__waker = (long)__builtin_return_address(0);
-#endif
- /*
- * If waking up from an interrupt context then
- * prefer processes which are affine to this
- * CPU.
- */
- if (irq && (state & mode & TASK_EXCLUSIVE)) {
- if (!best_exclusive)
- best_exclusive = p;
- else if ((p->processor == best_cpu) &&
- (best_exclusive->processor != best_cpu))
- best_exclusive = p;
+ if (p->state & mode) {
+ struct task_struct *to_wake = NULL;
+
+ if (WQ_SMP && exclusive_mode && irq) {
+ if (p->processor == best_cpu || tmp == separator)
+ to_wake = p;
} else {
+ to_wake = p;
+ }
+
+ if (to_wake) {
if (sync)
- wake_up_process_synchronous(p);
+ wake_up_process_synchronous(to_wake);
else
- wake_up_process(p);
- if (state & mode & TASK_EXCLUSIVE)
+ wake_up_process(to_wake);
+
+ if (exclusive_mode)
break;
}
}
}
- if (best_exclusive)
- best_exclusive->state = TASK_RUNNING;
- wq_write_unlock_irqrestore(&q->lock, flags);

- if (best_exclusive) {
- if (sync)
- wake_up_process_synchronous(best_exclusive);
- else
- wake_up_process(best_exclusive);
- }
+ wq_write_unlock_irqrestore(&q->lock, flags);
out:
return;
}

Attachments:

flock.patch (454.00 B)
task_exclusive.patch (11.48 kB)
Download all attachments

2000-11-04 05:09:03

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

dean gaudet wrote:
>
> On Tue, 31 Oct 2000, Andrew Morton wrote:
>
> > Dean, it looks like the same problem will occur with flock()-based
> > serialisation. Does Apache/Linux ever use that option?
>
> from apache/src/include/ap_config.h in the linux section there's
> this:
>
> /* flock is faster ... but hasn't been tested on 1.x systems */
> /* PR#3531 indicates flock() may not be stable, probably depends on
> * kernel version. Go back to using fcntl, but provide a way for
> * folks to tweak their Configuration to get flock.
> */
> #ifndef USE_FLOCK_SERIALIZED_ACCEPT
> #define USE_FCNTL_SERIALIZED_ACCEPT
> #endif
>
> so you should be able to -DUSE_FLOCK_SERIALIZED_ACCEPT to try it.
>

Dean,

neither flock() nor fcntl() serialisation are effective
on linux 2.2 or linux 2.4. This is because the file
locking code still wakes up _all_ waiters. In my testing
with fcntl serialisation I have seen a single Apache
instance get woken and put back to sleep 1,500 times
before the poor thing actually got to service a request.

For kernel 2.2 I recommend that Apache consider using
sysv semaphores for serialisation. They use wake-one.

For kernel 2.4 I recommend that Apache use unserialised
accept.

This means that you'll need to make a runtime decision
on whether to use unserialised, serialised with sysv or
serialised with fcntl (if sysv IPC isn't installed).

In my testing I launched 3, 10, 30 or 150 Apache instances and then used

httperf --num-conns=2000 --num-calls=1 --uri=/index.html

to open, use and close 2000 connections.

Here are the (terrible) results on 2.4 SMP with fcntl
serialisation:

fcntl accept, 3 servers, vanilla: 938.0 req/s
fcntl accept, 30 servers, vanilla: 697.1 req/s
fcntl accept, 150 servers, vanilla: 99.9 req/s (sic)

2.4 SMP with no serialisation:

unserialised accept, 3 servers, vanilla: 1049.0 req/s
unserialised accept, 10 servers, vanilla: 968.8 req/s
unserialised accept, 30 servers, vanilla: 1040.2 req/s
unserialised accept, 150 servers, vanilla: 1091.4 req/s

2.4 SMP with no serialisation and my patch to the
wakeup and waitqueue code:

unserialised accept, 3 servers, task_exclusive: 1117.4 req/s
unserialised accept, 10 servers, task_exclusive: 1118.6 req/s
unserialised accept, 30 servers, task_exclusive: 1105.6 req/s
unserialised accept, 150 servers, task_exclusive: 1077.1 req/s

2.4 SMP with sysv semaphore serialisation:

sysvsem accept, 3 servers: 1001.2 req/s
sysvsem accept, 10 servers: 1061.0 req/s
sysvsem accept, 30 servers: 1021.2 req/s
sysvsem accept, 150 servers: 943.6 req/s

2.2.14 SMP with fcntl serialisation:

fcntl accept, 3 servers: 1053.8 req/s
fcntl accept, 10 servers: 996.2 req/s
fcntl accept, 30 servers: 934.3 req/s
fcntl accept, 150 servers: 141.4 req/s (sic)

2.2.14 SMP with no serialisation:

unserialised accept, 3 servers: 1039.9 req/s
unserialised accept, 10 servers: 983.1 req/s
unserialised accept, 30 servers: 775.7 req/s
unserialised accept, 150 servers: 220.7 req/s (sic)

2.2.14 SMP with sysv sem serialisation:

sysv accept, 3 servers: 932.2 req/s
sysv accept, 10 servers: 910.6 req/s
sysv accept, 30 servers: 1026.6 req/s
sysv accept, 150 servers: 927.2 req/s

Note that the first test (2.4 with fcntl serialisation) was
with an unpatched 2.4.0-test10-pre5. Once the simple
flock.patch is applied, the performance with 150 servers
doubles. But it's still sucky. The flock.patch change
is effective in increasing scalability wiht a large number
of CPUs, not a large number of httpd's.

Here's the silly patch I used to turn on sysv sem serialisation
in Apache. There's probably a better way than this :)

--- apache_1.3.14.orig/src/main/http_main.c Fri Sep 29 00:32:36 2000
+++ apache_1.3.14/src/main/http_main.c Sat Nov 4 15:01:41 2000
@@ -172,6 +172,13 @@

#include "explain.h"

+/* AKPM */
+#if 1
+#define NEED_UNION_SEMUN
+#define USE_SYSVSEM_SERIALIZED_ACCEPT
+#define USE_FCNTL_SERIALIZED_ACCEPT
+#endif
+
#if !defined(max)
#define max(a,b) (a > b ? a : b)
#endif

2000-11-04 05:55:59

by kumon

[permalink] [raw]

Subject: Preemptive scheduling of woken-up processes

The current schduler has a problem handling waken up processes.

In the current kernel woken up processes probably try to preempt CPU
from the currently running process.

That is the fundamental reason of the negative scalability of Apache
on a large SMP system, which has been reported already.
The modification of scheduler is introduced between test8 and test9.
So, test9 owes two problems for the Apache performance.
One for flock, another for scheduler.

Detailed Description:

When a process releases the semaphore, it wakes up the slept process
through wake_up_process(). Then, schedule_idle() has the final duty.

In the current scheduler logic, the woken up process may preempt CPU
from the current process if the goodness tells to do so. This is done
even if there are lots of idle CPUs. This severy disables parallel
execution on a MP system. The previously runnning process is runnable
but wating on run queue until the next scheduling timing, which
possibly is few milliseconds later.

# We've not yet proved above senario by experimentation. We'll do in
# next week if it is needed.

I think this scheduling policy is bad for MP systems.
Basically, the waken up process should run on the previous CPU ,
or, should run on an idle CPU.
This policy has been adopted in test8, and that is better, I think.

# Sorry to whom get this mail twice. I mistook LK-ML address.

--
Computer Systems Laboratory, Fujitsu Labs.
[email protected]

2000-11-04 06:24:04

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>
>neither flock() nor fcntl() serialisation are effective
>on linux 2.2 or linux 2.4. This is because the file
>locking code still wakes up _all_ waiters. In my testing
>with fcntl serialisation I have seen a single Apache
>instance get woken and put back to sleep 1,500 times
>before the poor thing actually got to service a request.

Indeed.

flock() is the absolute worst case, and always has been. I guess nobody
every actually bothered to benchmark it.

>For kernel 2.2 I recommend that Apache consider using
>sysv semaphores for serialisation. They use wake-one.
>
>For kernel 2.4 I recommend that Apache use unserialised
>accept.

No.

Please use unserialized accept() _always_, because we can fix that.

Even 2.2.x can be fixed to do the wake-one for accept(), if required.
It's not going to be any worse than the current apache config, and
basically the less games apache plays, the better the kernel can try to
accomodate what apache _really_ wants done. When playing games, you
hide what you really want done, and suddenly kernel profiles etc end up
being completely useless, because they no longer give the data we needed
to fix the problem.

Basically, the whole serialization crap is all about the Apache people
saying the equivalent of "the OS does a bad job on something we consider
to be incredibly important, so we do something else instead to hide it".

And regardless of _what_ workaround Apache does, whether it is the sucky
fcntl() thing or using SysV semaphores, it's going to hide the real
issue and mean that it never gets fixed properly.

And in the end it will result in really really bad performance.

Instead, if apache had just done the thing it wanted to do in the first
place, the wake-one accept() semantics would have happened a hell of a
lot earlier.

Now it's there in 2.4.x. Please use it. PLEASE PLEASE PLEASE don't play
games trying to outsmart the OS, it will just hurt Apache in the long run.

Linus

2000-11-04 10:54:27

by Alan

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of

> Even 2.2.x can be fixed to do the wake-one for accept(), if required.

Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
try and backport all the mechanism. I think for 2.2 using the semaphore is a
good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed

2000-11-04 17:23:44

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of

On Sat, 4 Nov 2000, Alan Cox wrote:
>
> > Even 2.2.x can be fixed to do the wake-one for accept(), if required.
>
> Do we really want to retrofit wake_one to 2.2. I know Im not terribly keen to
> try and backport all the mechanism. I think for 2.2 using the semaphore is a
> good approach. Its a hack to fix an old OS kernel. For 2.4 its not needed

We don't need to backport of the full exclusive wait queues: we could do
the equivalent of the semaphore inside the kernel around just accept(). It
wouldn't be a generic thing, but it would fix the specific case of
accept().

Otherwise we're going to have old binaries of apache lying around forever
that do the wrong thing..

Linus

2000-11-04 20:03:48

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

On Fri, 3 Nov 2000, Linus Torvalds wrote:

> Please use unserialized accept() _always_, because we can fix that.

i can unserialise the single socket case, but the multiple socket case is
not so simple.

the executive summary is that when you've got multiple sockets you have to
use select(). select is necessarily wake-all. remember there's N
children trying to do select/accept. if the listening socket is
non-blocking then you spin in N-1 children; if it's blocking then you
starve other sockets.

see http://www.apache.org/docs/misc/perf-tuning.html, search for "multiple
sockets" for my full analysis of the problem.

> Instead, if apache had just done the thing it wanted to do in the first
> place, the wake-one accept() semantics would have happened a hell of a
> lot earlier.

counter-example: freebsd had wake-one semantics a few years before linux.

revision 1.237
date: 1998/09/29 01:22:57; author: marc; state: Exp; lines: +1 -0
Unserialized accept() should be safe (in all versions) and efficient
(in anything vaguely recent) on FreeBSD.

ok, we done finger pointing? :)

-dean

2000-11-04 20:11:40

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange performance behavior of 2.4.0-test9)

On Sat, 4 Nov 2000, Andrew Morton wrote:

> Dean,
>
> neither flock() nor fcntl() serialisation are effective
> on linux 2.2 or linux 2.4.

i have to admit the last time i timed any of the methods on linux was in
2.0.x days. thanks for the updated data!

> For kernel 2.2 I recommend that Apache consider using
> sysv semaphores for serialisation. They use wake-one.

sysv semaphores have a very unfortunate negative feature -- if the admin
kill -9's the server (impatient admins do this all the time) then you end
up leaving a semaphore lying around. sysvsem don't have the usual unix
unlink semantics. actually flock has the same problem... which is why i
generally preferred fcntl whenever it was a performance wash, as it was
back in 2.0.x days.

however given the vast performance difference i think it warrants the
change. i'll include your results with the commit.

> For kernel 2.4 I recommend that Apache use unserialised
> accept.

per linus' request i'll unserialise 2.2 as well.

i'll leave 2.0.x settings alone.

(oh yeah, and compile-time only detection.)

-dean

2000-11-04 20:43:07

by Alan

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

> > Instead, if apache had just done the thing it wanted to do in the first
> > place, the wake-one accept() semantics would have happened a hell of a
> > lot earlier.
>
> counter-example: freebsd had wake-one semantics a few years before linux.

And Im sure apache authors can use the utsname() syscall 8)

2000-11-04 20:43:47

by Alan

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

> sysv semaphores have a very unfortunate negative feature -- if the admin
> kill -9's the server (impatient admins do this all the time) then you end
> up leaving a semaphore lying around. sysvsem don't have the usual unix

Umm they have SEM_UNDO. Its a case of deeper magic

2000-11-05 04:52:34

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of lock_kernel()?(Was:Strange

On Sat, 4 Nov 2000, Alan Cox wrote:

> > sysv semaphores have a very unfortunate negative feature -- if the admin
> > kill -9's the server (impatient admins do this all the time) then you end
> > up leaving a semaphore lying around. sysvsem don't have the usual unix
>
> Umm they have SEM_UNDO. Its a case of deeper magic

we use SEM_UNDO, that's not quite what i was worrying about. i was
worrying about leaving a stale semaphore in the global semaphore table.

IPC_RMID causes the semaphore to be destroyed immediately, rather than
after all the users are done.

-dean

2000-11-05 16:23:25

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH] Re: Negative scalability by removal of

On Sat, Nov 04, 2000 at 09:22:58AM -0800, Linus Torvalds wrote:
> We don't need to backport of the full exclusive wait queues: we could do
> the equivalent of the semaphore inside the kernel around just accept(). It
> wouldn't be a generic thing, but it would fix the specific case of
> accept().

The first wake-one patch floating around was against 2.2.x waitqueues and
it's a very simple patch and it fixes the problem (it also gives LIFO
to accept with the downside that it needs to do an O(N) browse on the
waitqueue before doing the exclusive wakeup compared to 2.4.x that does
the wake-one task selection in O(1) if everybody is sleeping in accept, but it
does that FIFO unfortunately).

The real problem that DaveM knows well is that TCP in 2.2.x will end doing
three wakeups every time the socket moves from LISTEN to ESTABLISHED state, so
it was really doing a wake-three not a wake-one :). So the brainer part
is to fix TCP and not the scheduler/waitqueue part.

Andrea

2000-11-05 20:22:16