2003-01-03 16:48:42

by jak

[permalink] [raw]
Subject: 2.4.21-pre2 stalls out when running unixbench

Hi Everyone,
2.4.21-pre2 and -pre1 stall when running unixbench 4.1.0. 2.4.20
works perfectly. Stall is detected by running

while true; do sync; sleep 1; echo -e '.\c'; done

in another window while unixbench is running. Stall is assumed if
the dots stop printing for 10 minutes. Stall will happen anywhere
from 40 seconds to 6-7 minutes into the run.

Rather than run all of unixbench, I have been running the parts that
seem to trigger the issue:

for i in 0 1 2 3 4 5 6 7 8 9; do
./Run fstime fsbuffer fsdisk
done


Other windows continue to operate and accept new commands, as long as
one stays away from sync(1).

Another 2.4.21-pre2 kernel, loaded up with patches (kdb, lcrash,
preempt, O1, etc), also exhibits the same behavior. I used this
kernel to get kdb stack trackbacks of all processes sitting in
uninterruptible-sleep. This may be interesting since generally one
cannot catch any process sleeping uninterruptably unless the system
is suffering from a sleep-semaphore deadlock.

Random data points: ext3, scsi, dual cpu, problem did not appear when
booted singleuser (one attempt), kernel built with gcc version 2.96
20000731 (Red Hat Linux 7.3 2.96-110).

Joe - Concurrent Computer Corporation


[0]kdb> ps
Task Addr Pid Parent [*] cpu State Thread Command
0xc282e000 00000001 00000000 0 000 islp 0xc282e400 init
0xc282a000 00000002 00000001 0 000 islp 0xc282a400 migration/0
0xc2828000 00000003 00000001 0 001 islp 0xc2828400 migration/1
0xf7bf4000 00000004 00000001 0 001 islp 0xf7bf4400 keventd
0xf7bf2000 00000005 00000001 0 000 islp 0xf7bf2400 ksoftirqd/0
0xf7bee000 00000006 00000001 0 001 islp 0xf7bee400 ksoftirqd/1
0xf7bda000 00000007 00000001 0 000 islp 0xf7bda400 kswapd
0xf7bb6000 00000008 00000001 0 000 islp 0xf7bb6400 bdflush
0xf7bb2000 00000009 00000001 0 000 islp 0xf7bb2400 kupdated
0xf7b42000 00000010 00000001 0 000 islp 0xf7b42400 scsi_eh_0
0xf7b40000 00000011 00000001 0 000 islp 0xf7b40400 scsi_eh_1
0xf7b3c000 00000012 00000001 0 000 islp 0xf7b3c400 scsi_eh_2
0xf7af8000 00000013 00000001 0 001 islp 0xf7af8400 khubd
0xf7abe000 00000014 00000001 0 001 islp 0xf7abe400 kjournald
0xf76de000 00000111 00000001 0 000 islp 0xf76de400 kjournald
0xf76ca000 00000112 00000001 0 000 islp 0xf76ca400 kjournald
0xf76c2000 00000113 00000001 0 000 uslp 0xf76c2400 kjournald
0xf77ba000 00000222 00000001 0 000 islp 0xf77ba400 login
0xf74a4000 00000397 00000001 0 000 islp 0xf74a4400 syslogd
0xf7496000 00000401 00000001 0 000 islp 0xf7496400 klogd
0xf7488000 00000412 00000001 0 000 islp 0xf7488400 portmap
[0]more>
0xf7474000 00000431 00000001 0 001 islp 0xf7474400 rpc.statd
0xf7370000 00000498 00000001 0 000 islp 0xf7370400 sshd
0xf7338000 00000512 00000001 0 001 islp 0xf7338400 xinetd
0xf7190000 00000525 00000001 0 001 islp 0xf7190400 rpc.rquotad
0xf7186000 00000529 00000001 0 001 islp 0xf7186400 nfsd
0xf7180000 00000530 00000001 0 001 islp 0xf7180400 nfsd
0xf7176000 00000531 00000001 0 001 islp 0xf7176400 nfsd
0xf716e000 00000532 00000001 0 000 islp 0xf716e400 nfsd
0xf7166000 00000533 00000001 0 000 islp 0xf7166400 nfsd
0xf715e000 00000534 00000001 0 000 islp 0xf715e400 nfsd
0xf715c000 00000535 00000001 0 000 islp 0xf715c400 nfsd
0xf7154000 00000536 00000001 0 001 islp 0xf7154400 nfsd
0xf719c000 00000538 00000001 0 001 islp 0xf719c400 lockd
0xf714c000 00000539 00000538 0 001 islp 0xf714c400 rpciod
0xf7136000 00000544 00000001 0 001 islp 0xf7136400 rpc.mountd
0xf7110000 00000554 00000001 0 000 islp 0xf7110400 elmd
0xf7046000 00000640 00000001 0 000 islp 0xf7046400 gpm
0xf7030000 00000649 00000001 0 001 islp 0xf7030400 crond
0xf6c9a000 00000722 00000001 0 000 islp 0xf6c9a400 xfs
0xf76e4000 00000740 00000001 0 000 islp 0xf76e4400 atd
0xf6c5e000 00000750 00000001 0 000 islp 0xf6c5e400 nstar.d
0xf6cae000 00000757 00000001 0 001 islp 0xf6cae400 mingetty
0xf6c5c000 00000758 00000001 0 000 islp 0xf6c5c400 mingetty
[0]more>
0xf6bd4000 00000761 00000222 0 000 islp 0xf6bd4400 bash
0xf6b98000 00000807 00000761 0 000 islp 0xf6b98400 Run
0xf6b2e000 00000867 00000512 0 001 islp 0xf6b2e400 in.telnetd
0xf6b0e000 00000868 00000867 0 001 islp 0xf6b0e400 login
0xf693e000 00000869 00000868 0 000 islp 0xf693e400 bash
0xf694a000 00001363 00000807 0 000 islp 0xf694a400 time
0xf4fbe000 00001364 00001363 0 001 uslp 0xf4fbe400 fsdisk
0xf49f2000 00001468 00000869 0 000 uslp 0xf49f2400 sync

[0]kdb> btp 113
EBP EIP Function(args)
0xf76c3e18 0xc011c065 do_schedule+0x4f5 (0xf76c3e24, 0xc014d1b8, 0xf4cf3c80, 0x0, 0xf76c3fb8)
kernel .text 0xc0100000 0xc011bb70 0xc011c130
0xf76c3e60 0xc014bb4e __wait_on_buffer+0xce (0xf481bc00, 0x2, 0x0, 0x2ee0, 0xf76c2000)
kernel .text 0xc0100000 0xc014ba80 0xc014bbe0
0xf76c3fb8 0xc0181edf journal_commit_transaction+0x4ef (0xf76ce800, 0x0, 0x0, 0x1d27f, 0x0)
kernel .text 0xc0100000 0xc01819f0 0xc0182c9b
0xf76c3fec 0xc0184bc6 kjournald+0x176
kernel .text 0xc0100000 0xc0184a50 0xc0184ca0
0xc010745d kernel_thread+0x3d
kernel .text 0xc0100000 0xc0107420 0xc01074d0

[0]kdb> btp 1364
EBP EIP Function(args)
0xf4fbfd74 0xc011c065 do_schedule+0x4f5 (0xf4fbfd80, 0xf4fbfdb4, 0xc01ef37e, 0x805, 0xf76ce800)
kernel .text 0xc0100000 0xc011bb70 0xc011c130
0xf4fbfdbc 0xc014bb4e __wait_on_buffer+0xce (0xf481bc00, 0x1, 0xf76ce800, 0xc01801e6, 0xf6bb7d80)
kernel .text 0xc0100000 0xc014ba80 0xc014bbe0
0xf4fbfdec 0xc0180626 journal_dirty_data+0x116 (0xf6c20280, 0xf481bc00, 0x0, 0xc1aeb360, 0x0)
kernel .text 0xc0100000 0xc0180510 0xc0180700
0xf4fbfe4c 0xc0178068 journal_dirty_sync_data+0x18 (0xf6c20280, 0xf6b5d580, 0xf481bc00, 0x1000, 0x0)
kernel .text 0xc0100000 0xc0178050 0xc01780c0
0xf4fbfe70 0xc0177d74 walk_page_buffers+0x54 (0xf6c20280, 0xf6b5d580, 0xf481bc00, 0x0, 0x1000)
kernel .text 0xc0100000 0xc0177d20 0xc0177da0
0xf4fbfebc 0xc017831a ext3_commit_write+0x18a (0xf6961d80, 0xc1aeb360, 0x0, 0x1000, 0xf4fbff00)
kernel .text 0xc0100000 0xc0178190 0xc0178450
0xf4fbff54 0xc013c5af generic_file_write+0x4af (0xf6961d80, 0x80491e0, 0x1000, 0xf6961da0, 0xc0175790)
kernel .text 0xc0100000 0xc013c100 0xc013c810
0xf4fbff78 0xc01757b1 ext3_file_write+0x21 (0xf6961d80, 0x80491e0, 0x1000, 0xf6961da0, 0x1)
kernel .text 0xc0100000 0xc0175790 0xc0175840
0xf4fbffbc 0xc014a592 sys_write+0x112
kernel .text 0xc0100000 0xc014a480 0xc014a650
0xc0109234 trace_syscall_entry_ret+0x2d
kernel .text 0xc0100000 0xc0109207 0xc0109268

[0]kdb> btp 1468
EBP EIP Function(args)
0xf49f3f28 0xc011c065 do_schedule+0x4f5 (0x0, 0xf49f2000, 0xf76ce854, 0xf76ce854, 0xf76ce800)
kernel .text 0xc0100000 0xc011bb70 0xc011c130
0xf49f3f4c 0xc011c695 sleep_on+0x65 (0xf76cea00, 0x1)
kernel .text 0xc0100000 0xc011c630 0xc011c6f0
0xf49f3f5c 0xc01851da log_wait_commit+0x6a (0xf76ce800, 0x4b5df, 0xf76ce800, 0x0, 0xf76cea00)
kernel .text 0xc0100000 0xc0185170 0xc0185230
0xf49f3f78 0xc017efa6 ext3_sync_fs+0x26 (0xf76cea00, 0xf49f2000, 0x0)
kernel .text 0xc0100000 0xc017ef80 0xc017efb0
0xf49f3f8c 0xc01508ce sync_supers+0x13e (0x0, 0x1, 0x0, 0xffffffff, 0x0)
kernel .text 0xc0100000 0xc0150790 0xc0150920
0xf49f3fb0 0xc014c108 fsync_dev+0x58 (0x0)
kernel .text 0xc0100000 0xc014c0b0 0xc014c150
0xf49f3fbc 0xc014c16a sys_sync+0xa
kernel .text 0xc0100000 0xc014c160 0xc014c170
0xc0109234 trace_syscall_entry_ret+0x2d
kernel .text 0xc0100000 0xc0109207 0xc0109268


2003-01-03 19:58:53

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

Joe Korty wrote:
>
> Hi Everyone,
> 2.4.21-pre2 and -pre1 stall when running unixbench 4.1.0. 2.4.20
> works perfectly.

You don't have enough CPUs :) Unpatched 2.4.20 does the same thing.

Thanks for picking this up. Let me crunch on it a bit.

2003-01-03 20:23:47

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

In the previous episode of "travails of a geriatric kernel jock",
Andrew Morton wrote:
>
> Unpatched 2.4.20 does the same thing.
>

No it doesn't.

Good news is that 2.4.20 plus recent ext3 fixes doesn't lock up
either.

2003-01-04 01:03:30

by jak

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

> In the previous episode of "travails of a geriatric kernel jock",
> Andrew Morton wrote:
> >
> > Unpatched 2.4.20 does the same thing.
> >
>
> No it doesn't.
>
> Good news is that 2.4.20 plus recent ext3 fixes doesn't lock up
> either.


Hi Andrew,
I have continued to play with the problem, and have gathered more evidence.

Everything works (kinda) when I back out the attached 2.4.20-pre1
patches. I am not sure how to uniquely identify patches in the BK
tree, so I have enclosed the full text of each at the end of this
letter.

A remaining problem are long stallouts not present in 2.4.20. The
stallouts are of uniform duration, 20 seconds, triggered aperiodically.

To show the aperiodic nature of the stallouts, I've replaced my simple dot test:

while true; do sync; sleep 1; echo -e '.\c'; done

with a new version that prints a newline each time sync(1) takes more than
three seconds to run:

#!/bin/bash
(
while true; do
sleep 1
(sleep 3; echo) &
child=$!
sync
printf .
kill $child >/dev/null
done
) 2>/dev/null

The dot output pattern for a long run (unmeasured but between 1/2 and
1 hr) looks like:

...............
.........................................................
......
.........................................................
......
.........................................................
........
.........................................................
......
.........................................................
......
.........................................................
........
.........................................................
......
.........................................................
......
.........................................................
........
.........................................................
......
.........................................................
......
.........................................................
.........
.........................................................
......
.........................................................
......
.........................................................
........
.........................................................
......
.........................................................
......
.........................................................
.........
.........................................................
......
.........................................................
......
.........................................................
........
.........................................................
......
..........................................................



# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.757.33.12 -> 1.757.33.13
# kernel/ksyms.c 1.65 -> 1.66
# include/linux/fs.h 1.70 -> 1.71
# net/socket.c 1.12 -> 1.12.1.1
# fs/super.c 1.46 -> 1.47
# fs/inode.c 1.35 -> 1.36
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/11/29 [email protected] 1.757.33.13
# [PATCH] backport 2.5 inode allocation changes
#
# This patch adds ->alloc_inode and ->destroy_inode super operations to
# allow the filesystem control the allocation of struct inode, e.g. to
# have it's private inode and the VFS one n the same slab cache.
#
# It allows to break worst-offenders like NFS out of the big inode union
# and make VM balancing better by wasting less ram for inodes. It also
# speedups filesystems that don't want to touch that union in struct
# inode, like JFS, XFS or FreeVxFS (once switched over). It is a straight
# backport from Al's code in 2.5 and has proven stable in Red Hat's
# recent beta releases (limbo, null). Al has ACKed my patch submission.
#
# Credits go to Daniel Phillips for the initial design.
#
# NOTE: you want my b_inode removal patch applied before this one.
# --------------------------------------------
#
diff -Nru a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c Fri Jan 3 18:08:59 2003
+++ b/fs/inode.c Fri Jan 3 18:08:59 2003
@@ -75,13 +75,59 @@

static kmem_cache_t * inode_cachep;

-#define alloc_inode() \
- ((struct inode *) kmem_cache_alloc(inode_cachep, SLAB_KERNEL))
+static struct inode *alloc_inode(struct super_block *sb)
+{
+ static struct address_space_operations empty_aops;
+ static struct inode_operations empty_iops;
+ static struct file_operations empty_fops;
+ struct inode *inode;
+
+ if (sb->s_op->alloc_inode)
+ inode = sb->s_op->alloc_inode(sb);
+ else {
+ inode = (struct inode *) kmem_cache_alloc(inode_cachep, SLAB_KERNEL);
+ /* will die */
+ if (inode)
+ memset(&inode->u, 0, sizeof(inode->u));
+ }
+
+ if (inode) {
+ struct address_space * const mapping = &inode->i_data;
+
+ inode->i_sb = sb;
+ inode->i_dev = sb->s_dev;
+ inode->i_blkbits = sb->s_blocksize_bits;
+ inode->i_flags = 0;
+ atomic_set(&inode->i_count, 1);
+ inode->i_sock = 0;
+ inode->i_op = &empty_iops;
+ inode->i_fop = &empty_fops;
+ inode->i_nlink = 1;
+ atomic_set(&inode->i_writecount, 0);
+ inode->i_size = 0;
+ inode->i_blocks = 0;
+ inode->i_generation = 0;
+ memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
+ inode->i_pipe = NULL;
+ inode->i_bdev = NULL;
+ inode->i_cdev = NULL;
+
+ mapping->a_ops = &empty_aops;
+ mapping->host = inode;
+ mapping->gfp_mask = GFP_HIGHUSER;
+ inode->i_mapping = mapping;
+ }
+ return inode;
+}
+
static void destroy_inode(struct inode *inode)
{
if (inode_has_buffers(inode))
BUG();
- kmem_cache_free(inode_cachep, (inode));
+ if (inode->i_sb->s_op->destroy_inode)
+ inode->i_sb->s_op->destroy_inode(inode);
+ else
+ kmem_cache_free(inode_cachep, inode);
}


@@ -90,27 +136,30 @@
* once, because the fields are idempotent across use
* of the inode, so let the slab aware of that.
*/
+void inode_init_once(struct inode *inode)
+{
+ memset(inode, 0, sizeof(*inode));
+ init_waitqueue_head(&inode->i_wait);
+ INIT_LIST_HEAD(&inode->i_hash);
+ INIT_LIST_HEAD(&inode->i_data.clean_pages);
+ INIT_LIST_HEAD(&inode->i_data.dirty_pages);
+ INIT_LIST_HEAD(&inode->i_data.locked_pages);
+ INIT_LIST_HEAD(&inode->i_dentry);
+ INIT_LIST_HEAD(&inode->i_dirty_buffers);
+ INIT_LIST_HEAD(&inode->i_dirty_data_buffers);
+ INIT_LIST_HEAD(&inode->i_devices);
+ sema_init(&inode->i_sem, 1);
+ sema_init(&inode->i_zombie, 1);
+ spin_lock_init(&inode->i_data.i_shared_lock);
+}
+
static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)
{
struct inode * inode = (struct inode *) foo;

if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
SLAB_CTOR_CONSTRUCTOR)
- {
- memset(inode, 0, sizeof(*inode));
- init_waitqueue_head(&inode->i_wait);
- INIT_LIST_HEAD(&inode->i_hash);
- INIT_LIST_HEAD(&inode->i_data.clean_pages);
- INIT_LIST_HEAD(&inode->i_data.dirty_pages);
- INIT_LIST_HEAD(&inode->i_data.locked_pages);
- INIT_LIST_HEAD(&inode->i_dentry);
- INIT_LIST_HEAD(&inode->i_dirty_buffers);
- INIT_LIST_HEAD(&inode->i_dirty_data_buffers);
- INIT_LIST_HEAD(&inode->i_devices);
- sema_init(&inode->i_sem, 1);
- sema_init(&inode->i_zombie, 1);
- spin_lock_init(&inode->i_data.i_shared_lock);
- }
+ inode_init_once(inode);
}

/*
@@ -757,72 +806,28 @@
return inode;
}

-/*
- * This just initializes the inode fields
- * to known values before returning the inode..
- *
- * i_sb, i_ino, i_count, i_state and the lists have
- * been initialized elsewhere..
- */
-static void clean_inode(struct inode *inode)
-{
- static struct address_space_operations empty_aops;
- static struct inode_operations empty_iops;
- static struct file_operations empty_fops;
- memset(&inode->u, 0, sizeof(inode->u));
- inode->i_sock = 0;
- inode->i_op = &empty_iops;
- inode->i_fop = &empty_fops;
- inode->i_nlink = 1;
- atomic_set(&inode->i_writecount, 0);
- inode->i_size = 0;
- inode->i_blocks = 0;
- inode->i_generation = 0;
- memset(&inode->i_dquot, 0, sizeof(inode->i_dquot));
- inode->i_pipe = NULL;
- inode->i_bdev = NULL;
- inode->i_cdev = NULL;
- inode->i_data.a_ops = &empty_aops;
- inode->i_data.host = inode;
- inode->i_data.gfp_mask = GFP_HIGHUSER;
- inode->i_mapping = &inode->i_data;
-}
-
/**
- * get_empty_inode - obtain an inode
+ * new_inode - obtain an inode
+ * @sb: superblock
*
- * This is called by things like the networking layer
- * etc that want to get an inode without any inode
- * number, or filesystems that allocate new inodes with
- * no pre-existing information.
- *
- * On a successful return the inode pointer is returned. On a failure
- * a %NULL pointer is returned. The returned inode is not on any superblock
- * lists.
+ * Allocates a new inode for given superblock.
*/

-struct inode * get_empty_inode(void)
+struct inode * new_inode(struct super_block *sb)
{
static unsigned long last_ino;
struct inode * inode;

spin_lock_prefetch(&inode_lock);

- inode = alloc_inode();
- if (inode)
- {
+ inode = alloc_inode(sb);
+ if (inode) {
spin_lock(&inode_lock);
inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
- inode->i_sb = NULL;
- inode->i_dev = 0;
- inode->i_blkbits = 0;
inode->i_ino = ++last_ino;
- inode->i_flags = 0;
- atomic_set(&inode->i_count, 1);
inode->i_state = 0;
spin_unlock(&inode_lock);
- clean_inode(inode);
}
return inode;
}
@@ -837,7 +842,7 @@
{
struct inode * inode;

- inode = alloc_inode();
+ inode = alloc_inode(sb);
if (inode) {
struct inode * old;

@@ -848,16 +853,9 @@
inodes_stat.nr_inodes++;
list_add(&inode->i_list, &inode_in_use);
list_add(&inode->i_hash, head);
- inode->i_sb = sb;
- inode->i_dev = sb->s_dev;
- inode->i_blkbits = sb->s_blocksize_bits;
inode->i_ino = ino;
- inode->i_flags = 0;
- atomic_set(&inode->i_count, 1);
inode->i_state = I_LOCK;
spin_unlock(&inode_lock);
-
- clean_inode(inode);

/* reiserfs specific hack right here. We don't
** want this to last, and are looking for VFS changes
diff -Nru a/fs/super.c b/fs/super.c
--- a/fs/super.c Fri Jan 3 18:08:59 2003
+++ b/fs/super.c Fri Jan 3 18:08:59 2003
@@ -262,6 +262,7 @@
*/
static struct super_block *alloc_super(void)
{
+ static struct super_operations empty_sops = {};
struct super_block *s = kmalloc(sizeof(struct super_block), GFP_USER);
if (s) {
memset(s, 0, sizeof(struct super_block));
@@ -279,6 +280,7 @@
sema_init(&s->s_dquot.dqio_sem, 1);
sema_init(&s->s_dquot.dqoff_sem, 1);
s->s_maxbytes = MAX_NON_LFS;
+ s->s_op = &empty_sops;
}
return s;
}
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h Fri Jan 3 18:08:59 2003
+++ b/include/linux/fs.h Fri Jan 3 18:08:59 2003
@@ -879,6 +879,9 @@
* without the big kernel lock held in all filesystems.
*/
struct super_operations {
+ struct inode *(*alloc_inode)(struct super_block *sb);
+ void (*destroy_inode)(struct inode *);
+
void (*read_inode) (struct inode *);

/* reiserfs kludge. reiserfs needs 64 bits of information to
@@ -1367,6 +1370,7 @@
#define user_path_walk(name,nd) __user_walk(name, LOOKUP_FOLLOW|LOOKUP_POSITIVE, nd)
#define user_path_walk_link(name,nd) __user_walk(name, LOOKUP_POSITIVE, nd)

+extern void inode_init_once(struct inode *);
extern void iput(struct inode *);
extern void force_delete(struct inode *);
extern struct inode * igrab(struct inode *);
@@ -1380,18 +1384,7 @@
}

extern void clear_inode(struct inode *);
-extern struct inode * get_empty_inode(void);
-
-static inline struct inode * new_inode(struct super_block *sb)
-{
- struct inode *inode = get_empty_inode();
- if (inode) {
- inode->i_sb = sb;
- inode->i_dev = sb->s_dev;
- inode->i_blkbits = sb->s_blocksize_bits;
- }
- return inode;
-}
+extern struct inode *new_inode(struct super_block *sb);
extern void remove_suid(struct inode *inode);

extern void insert_inode_hash(struct inode *);
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c Fri Jan 3 18:08:59 2003
+++ b/kernel/ksyms.c Fri Jan 3 18:08:59 2003
@@ -142,6 +142,7 @@
EXPORT_SYMBOL(iunique);
EXPORT_SYMBOL(iget4);
EXPORT_SYMBOL(iput);
+EXPORT_SYMBOL(inode_init_once);
EXPORT_SYMBOL(force_delete);
EXPORT_SYMBOL(follow_up);
EXPORT_SYMBOL(follow_down);
@@ -523,7 +524,7 @@
EXPORT_SYMBOL(init_special_inode);
EXPORT_SYMBOL(read_ahead);
EXPORT_SYMBOL(get_hash_table);
-EXPORT_SYMBOL(get_empty_inode);
+EXPORT_SYMBOL(new_inode);
EXPORT_SYMBOL(insert_inode_hash);
EXPORT_SYMBOL(remove_inode_hash);
EXPORT_SYMBOL(buffer_insert_list);
diff -Nru a/net/socket.c b/net/socket.c
--- a/net/socket.c Fri Jan 3 18:08:59 2003
+++ b/net/socket.c Fri Jan 3 18:08:59 2003
@@ -436,11 +436,11 @@
struct inode * inode;
struct socket * sock;

- inode = get_empty_inode();
+ inode = new_inode(sock_mnt->mnt_sb);
if (!inode)
return NULL;

- inode->i_sb = sock_mnt->mnt_sb;
+ inode->i_dev = NODEV;
sock = socki_lookup(inode);

inode->i_mode = S_IFSOCK|S_IRWXUGO;















# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.757.33.11 -> 1.757.33.12
# kernel/ksyms.c 1.64 -> 1.65
# fs/reiserfs/journal.c 1.24 -> 1.25
# include/linux/fs.h 1.69 -> 1.70
# fs/reiserfs/inode.c 1.39 -> 1.40
# include/linux/reiserfs_fs_sb.h 1.14 -> 1.15
# fs/buffer.c 1.76 -> 1.77
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/11/29 [email protected] 1.757.33.12
# [PATCH] cleanup b_inode usage and fix onstack inode abuse
#
# Currently the b_inode of struct buffer_head is a pointer to an inode, but
# it only always used as bool value.
#
# This patch changes it to be a flag bit in the bh state. This allows
# us to shape of 32bit rom struct buffer_head for other uses like
# the lower layer private data that LVM2 needs or an increase of b_size
# that IA64 boxens with 64k pages need.
#
# I also cleanes up buffer.c be removing lots of duplicated code and
# enables the alloc_inode patch by removing the last on-stack inodes.
#
# The patch has been ACKed by akpm and sct and a very similar change
# went into early 2.5.
# --------------------------------------------
#
diff -Nru a/fs/buffer.c b/fs/buffer.c
--- a/fs/buffer.c Fri Jan 3 18:09:07 2003
+++ b/fs/buffer.c Fri Jan 3 18:09:07 2003
@@ -583,37 +583,29 @@
return bh;
}

-void buffer_insert_inode_queue(struct buffer_head *bh, struct inode *inode)
+void buffer_insert_list(struct buffer_head *bh, struct list_head *list)
{
spin_lock(&lru_list_lock);
- if (bh->b_inode)
+ if (buffer_attached(bh))
list_del(&bh->b_inode_buffers);
- bh->b_inode = inode;
- list_add(&bh->b_inode_buffers, &inode->i_dirty_buffers);
+ set_buffer_attached(bh);
+ list_add(&bh->b_inode_buffers, list);
spin_unlock(&lru_list_lock);
}

-void buffer_insert_inode_data_queue(struct buffer_head *bh, struct inode *inode)
-{
- spin_lock(&lru_list_lock);
- if (bh->b_inode)
- list_del(&bh->b_inode_buffers);
- bh->b_inode = inode;
- list_add(&bh->b_inode_buffers, &inode->i_dirty_data_buffers);
- spin_unlock(&lru_list_lock);
-}
-
-/* The caller must have the lru_list lock before calling the
- remove_inode_queue functions. */
+/*
+ * The caller must have the lru_list lock before calling the
+ * remove_inode_queue functions.
+ */
static void __remove_inode_queue(struct buffer_head *bh)
{
- bh->b_inode = NULL;
list_del(&bh->b_inode_buffers);
+ clear_buffer_attached(bh);
}

static inline void remove_inode_queue(struct buffer_head *bh)
{
- if (bh->b_inode)
+ if (buffer_attached(bh))
__remove_inode_queue(bh);
}

@@ -831,10 +823,10 @@
int fsync_buffers_list(struct list_head *list)
{
struct buffer_head *bh;
- struct inode tmp;
+ struct list_head tmp;
int err = 0, err2;

- INIT_LIST_HEAD(&tmp.i_dirty_buffers);
+ INIT_LIST_HEAD(&tmp);

spin_lock(&lru_list_lock);

@@ -842,10 +834,10 @@
bh = BH_ENTRY(list->next);
list_del(&bh->b_inode_buffers);
if (!buffer_dirty(bh) && !buffer_locked(bh))
- bh->b_inode = NULL;
+ clear_buffer_attached(bh);
else {
- bh->b_inode = &tmp;
- list_add(&bh->b_inode_buffers, &tmp.i_dirty_buffers);
+ set_buffer_attached(bh);
+ list_add(&bh->b_inode_buffers, &tmp);
if (buffer_dirty(bh)) {
get_bh(bh);
spin_unlock(&lru_list_lock);
@@ -865,8 +857,8 @@
}
}

- while (!list_empty(&tmp.i_dirty_buffers)) {
- bh = BH_ENTRY(tmp.i_dirty_buffers.prev);
+ while (!list_empty(&tmp)) {
+ bh = BH_ENTRY(tmp.prev);
remove_inode_queue(bh);
get_bh(bh);
spin_unlock(&lru_list_lock);
@@ -1138,7 +1130,7 @@
*/
static void __put_unused_buffer_head(struct buffer_head * bh)
{
- if (bh->b_inode)
+ if (unlikely(buffer_attached(bh)))
BUG();
if (nr_unused_buffer_heads >= MAX_UNUSED_BUFFERS) {
kmem_cache_free(bh_cachep, bh);
diff -Nru a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
--- a/fs/reiserfs/inode.c Fri Jan 3 18:09:07 2003
+++ b/fs/reiserfs/inode.c Fri Jan 3 18:09:07 2003
@@ -102,9 +102,9 @@
}

static void add_to_flushlist(struct inode *inode, struct buffer_head *bh) {
- struct inode *jinode = &(SB_JOURNAL(inode->i_sb)->j_dummy_inode) ;
+ struct reiserfs_journal *j = SB_JOURNAL(inode->i_sb) ;

- buffer_insert_inode_queue(bh, jinode) ;
+ buffer_insert_list(bh, &j->j_dirty_buffers) ;
}

//
diff -Nru a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
--- a/fs/reiserfs/journal.c Fri Jan 3 18:09:07 2003
+++ b/fs/reiserfs/journal.c Fri Jan 3 18:09:07 2003
@@ -1937,7 +1937,7 @@
memset(journal_writers, 0, sizeof(char *) * 512) ; /* debug code */

INIT_LIST_HEAD(&SB_JOURNAL(p_s_sb)->j_bitmap_nodes) ;
- INIT_LIST_HEAD(&(SB_JOURNAL(p_s_sb)->j_dummy_inode.i_dirty_buffers)) ;
+ INIT_LIST_HEAD(&SB_JOURNAL(p_s_sb)->j_dirty_buffers) ;
reiserfs_allocate_list_bitmaps(p_s_sb, SB_JOURNAL(p_s_sb)->j_list_bitmap,
SB_BMAP_NR(p_s_sb)) ;
allocate_bitmap_nodes(p_s_sb) ;
@@ -2933,7 +2933,7 @@
SB_JOURNAL_LIST_INDEX(p_s_sb) = jindex ;

/* write any buffers that must hit disk before this commit is done */
- fsync_inode_buffers(&(SB_JOURNAL(p_s_sb)->j_dummy_inode)) ;
+ fsync_buffers_list(&(SB_JOURNAL(p_s_sb)->j_dirty_buffers)) ;

/* honor the flush and async wishes from the caller */
if (flush) {
diff -Nru a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h Fri Jan 3 18:09:07 2003
+++ b/include/linux/fs.h Fri Jan 3 18:09:07 2003
@@ -219,6 +219,7 @@
BH_Async, /* 1 if the buffer is under end_buffer_io_async I/O */
BH_Wait_IO, /* 1 if we should write out this buffer */
BH_Launder, /* 1 if we can throttle on this buffer */
+ BH_Attached, /* 1 if b_inode_buffers is linked into a list */
BH_JBD, /* 1 if it has an attached journal_head */

BH_PrivateStart,/* not a state bit, but the first bit available
@@ -266,7 +267,6 @@
unsigned long b_rsector; /* Real buffer location on disk */
wait_queue_head_t b_wait;

- struct inode * b_inode;
struct list_head b_inode_buffers; /* doubly linked list of inode dirty buffers */
};

@@ -1170,8 +1170,18 @@
extern void FASTCALL(__mark_dirty(struct buffer_head *bh));
extern void FASTCALL(__mark_buffer_dirty(struct buffer_head *bh));
extern void FASTCALL(mark_buffer_dirty(struct buffer_head *bh));
-extern void FASTCALL(buffer_insert_inode_queue(struct buffer_head *, struct inode *));
-extern void FASTCALL(buffer_insert_inode_data_queue(struct buffer_head *, struct inode *));
+
+extern void FASTCALL(buffer_insert_list(struct buffer_head *, struct list_head *));
+
+static inline void buffer_insert_inode_queue(struct buffer_head *bh, struct inode *inode)
+{
+ buffer_insert_list(bh, &inode->i_dirty_buffers);
+}
+
+static inline void buffer_insert_inode_data_queue(struct buffer_head *bh, struct inode *inode)
+{
+ buffer_insert_list(bh, &inode->i_dirty_data_buffers);
+}

static inline int atomic_set_buffer_dirty(struct buffer_head *bh)
{
@@ -1184,6 +1194,21 @@
set_bit(BH_Async, &bh->b_state);
else
clear_bit(BH_Async, &bh->b_state);
+}
+
+static inline void set_buffer_attached(struct buffer_head *bh)
+{
+ __set_bit(BH_Attached, &bh->b_state);
+}
+
+static inline void clear_buffer_attached(struct buffer_head *bh)
+{
+ clear_bit(BH_Attached, &bh->b_state);
+}
+
+static inline int buffer_attached(struct buffer_head *bh)
+{
+ return test_bit(BH_Attached, &bh->b_state);
}

/*
diff -Nru a/include/linux/reiserfs_fs_sb.h b/include/linux/reiserfs_fs_sb.h
--- a/include/linux/reiserfs_fs_sb.h Fri Jan 3 18:09:07 2003
+++ b/include/linux/reiserfs_fs_sb.h Fri Jan 3 18:09:07 2003
@@ -312,7 +312,7 @@
int j_free_bitmap_nodes ;
int j_used_bitmap_nodes ;
struct list_head j_bitmap_nodes ;
- struct inode j_dummy_inode ;
+ struct list_head j_dirty_buffers ;
struct reiserfs_list_bitmap j_list_bitmap[JOURNAL_NUM_BITMAPS] ; /* array of bitmaps to record the deleted blocks */
struct reiserfs_journal_list j_journal_list[JOURNAL_LIST_COUNT] ; /* array of all the journal lists */
struct reiserfs_journal_cnode *j_hash_table[JOURNAL_HASH_SIZE] ; /* hash table for real buffer heads in current trans */
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c Fri Jan 3 18:09:07 2003
+++ b/kernel/ksyms.c Fri Jan 3 18:09:07 2003
@@ -526,8 +526,7 @@
EXPORT_SYMBOL(get_empty_inode);
EXPORT_SYMBOL(insert_inode_hash);
EXPORT_SYMBOL(remove_inode_hash);
-EXPORT_SYMBOL(buffer_insert_inode_queue);
-EXPORT_SYMBOL(buffer_insert_inode_data_queue);
+EXPORT_SYMBOL(buffer_insert_list);
EXPORT_SYMBOL(make_bad_inode);
EXPORT_SYMBOL(is_bad_inode);
EXPORT_SYMBOL(event);

2003-01-04 11:03:21

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

Joe Korty wrote:
>
> > In the previous episode of "travails of a geriatric kernel jock",
> > Andrew Morton wrote:
> > >
> > > Unpatched 2.4.20 does the same thing.
> > >
> >
> > No it doesn't.
> >
> > Good news is that 2.4.20 plus recent ext3 fixes doesn't lock up
> > either.
>
> Hi Andrew,
> I have continued to play with the problem, and have gathered more evidence.
>
> Everything works (kinda) when I back out the attached 2.4.20-pre1
> patches. I am not sure how to uniquely identify patches in the BK
> tree, so I have enclosed the full text of each at the end of this
> letter.
>
> A remaining problem are long stallouts not present in 2.4.20. The
> stallouts are of uniform duration, 20 seconds, triggered aperiodically.

This is because of differences in how sync() is handled between 2.4.20's
ext3 and 2.4.21-pre2's.

2.4.20:

sync() will _start_ a commit, but will not wait on it. This was done
to fix a throughput problem with kupdate periodic writeback, which
used to use the same code as sync.

Not waiting on the writeout is OK by the letter of the POSIX spec,
but is not traditional Linux behaviour, and is undesirable IMO.

2.4.21-pre2:

sync() will start the commit, and will wait on it. So you know that
when it returns, everything which was dirty is now tight on disk.

So yes, running a looping sync while someone else is writing stuff
will take much longer in 2.4.21-pre2, because that kernel actually
waits on the writeout.


With respect to the lockup problem: it is due to a non-atomic __set_bit()
in the new buffer_attached() implementation.

Sure, we don't need atomic semantics for the BH_Attached bit because
it is always read and modified under a global spinlock. But *other*
users of buffer_head.b_state do not run under that lock so the nonatomic
RMW will stomp on their changes. 2.4.20 does not have this bug.

Here is a patch:


--- 24/include/linux/fs.h~buffer_attached-fix Sat Jan 4 03:09:13 2003
+++ 24-akpm/include/linux/fs.h Sat Jan 4 03:09:16 2003
@@ -1202,7 +1202,7 @@ static inline void mark_buffer_async(str

static inline void set_buffer_attached(struct buffer_head *bh)
{
- __set_bit(BH_Attached, &bh->b_state);
+ set_bit(BH_Attached, &bh->b_state);
}

static inline void clear_buffer_attached(struct buffer_head *bh)

_

2003-01-05 02:50:28

by jak

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

> With respect to the lockup problem: it is due to a non-atomic __set_bit()
> in the new buffer_attached() implementation.
>
> Sure, we don't need atomic semantics for the BH_Attached bit because
> it is always read and modified under a global spinlock. But *other*
> users of buffer_head.b_state do not run under that lock so the nonatomic
> RMW will stomp on their changes. 2.4.20 does not have this bug.
>
> Here is a patch:


Hi Andrew,
The patch works (been running the unixbench subset for an hour now).
Your time and effort and very clear explanations are much appreciated.
Thanks,
Joe

2003-01-06 11:58:34

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

Hi,

On Sat, 2003-01-04 at 11:11, Andrew Morton wrote:

> This is because of differences in how sync() is handled between 2.4.20's
> ext3 and 2.4.21-pre2's.
>
> 2.4.21-pre2:
>
> sync() will start the commit, and will wait on it. So you know that
> when it returns, everything which was dirty is now tight on disk.
>
> So yes, running a looping sync while someone else is writing stuff
> will take much longer in 2.4.21-pre2, because that kernel actually
> waits on the writeout.

Actually, I'm wondering if we should back that particular bit out. For
a user with a hundred mounted filesystems, syncing each one in order,
sequentially, is going to suck (and we don't currently have a simple way
in 2.4 to detect which syncs are on separate spindles and so can be
parallelised.)

--Stephen

2003-01-06 12:08:14

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Sat, 2003-01-04 at 11:11, Andrew Morton wrote:
>
> > This is because of differences in how sync() is handled between 2.4.20's
> > ext3 and 2.4.21-pre2's.
> >
> > 2.4.21-pre2:
> >
> > sync() will start the commit, and will wait on it. So you know that
> > when it returns, everything which was dirty is now tight on disk.
> >
> > So yes, running a looping sync while someone else is writing stuff
> > will take much longer in 2.4.21-pre2, because that kernel actually
> > waits on the writeout.
>
> Actually, I'm wondering if we should back that particular bit out. For
> a user with a hundred mounted filesystems, syncing each one in order,
> sequentially, is going to suck (and we don't currently have a simple way
> in 2.4 to detect which syncs are on separate spindles and so can be
> parallelised.)
>

Well personally I prefer slow-and-safe. But we could make 2.4
do what 2.5 is doing - one pass through the superblocks to start
the syncs and a second pass to wait on them all.

This is fragile stuff though....

2003-01-06 12:07:41

by John Bradford

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

> > This is because of differences in how sync() is handled between 2.4.20's
> > ext3 and 2.4.21-pre2's.
> >
> > 2.4.21-pre2:
> >
> > sync() will start the commit, and will wait on it. So you know that
> > when it returns, everything which was dirty is now tight on disk.
> >
> > So yes, running a looping sync while someone else is writing stuff
> > will take much longer in 2.4.21-pre2, because that kernel actually
> > waits on the writeout.
>
> Actually, I'm wondering if we should back that particular bit out. For
> a user with a hundred mounted filesystems, syncing each one in order,
> sequentially, is going to suck (and we don't currently have a simple way
> in 2.4 to detect which syncs are on separate spindles and so can be
> parallelised.)

What!? I'm suprised that no userspace applications were broken by
sync returning before the data is safely on oxide, even though it
doesn't violate the POSIX spec.

What about userspace media-changers, (if such a thing exists)?
Presumably they would assume that they can eject the media after a sync.

I think sync should definitely wait until it's completed before it
returns.

John.

2003-01-06 13:08:33

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

Hi,

On Mon, 2003-01-06 at 12:15, John Bradford wrote:

> What!? I'm suprised that no userspace applications were broken by
> sync returning before the data is safely on oxide, even though it
> doesn't violate the POSIX spec.

sync(2) syncs _everything_ --- every M/O or floppy disk, every
filesystem no matter if it's fat or native Unix. It's way too
heavyweight for most applications which have synchronisation
requirements, when fsync() or O_SYNC are much more precise.

I think I can recall one or two mutterings about people worried about
theoretical risks of doing things like "lilo; sync; reboot", but that's
a different class of risk altogether. (It's complicated by the problem
that after a sync, ext3 guarantees that the data and metadata is on
disk, but it may still be in the journal, so lilo won't necessarily see
the right stuff with sync() alone --- ext3 plays extra tricks when it
sees bmap to try to solve this.)

--Stephen

2003-01-06 13:11:20

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

Hi,

On Mon, 2003-01-06 at 12:16, Andrew Morton wrote:

> Well personally I prefer slow-and-safe. But we could make 2.4
> do what 2.5 is doing - one pass through the superblocks to start
> the syncs and a second pass to wait on them all.

The 2.5 approach has the problem that it can start queuing writeback for
multiple fs'es on the same disk at the same time --- I wouldn't be
surprised if it increases thrashing in that case. But I guess I'm not
too concerned about sync() performance itself, as long as the in-kernel
background writeback is being done sensibly.

> This is fragile stuff though....

Yep.

--Stephen

2003-01-11 08:01:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.21-pre2 stalls out when running unixbench

On Sat, Jan 04, 2003 at 03:11:45AM -0800, Andrew Morton wrote:
> Sure, we don't need atomic semantics for the BH_Attached bit because
> it is always read and modified under a global spinlock. But *other*
> users of buffer_head.b_state do not run under that lock so the nonatomic
> RMW will stomp on their changes. 2.4.20 does not have this bug.

Thanks, I still had to learn something about *_bit() semantics.

And sorry for introducing that bug..