Hi,
Running any kernel from the 2.6.6-rc* series (and a few previous
-mm*'s), the pdflush process starts using near 100% CPU indefinitely after
a few minutes of initial NFS traffic, as far as I can tell.
To trigger this, I can either compile a kernel with the source residing on
an NFS share, or compile something bigger, like KDE.
I get the same results running on a PIII with an i815 chipset with
ReiserFS 3, or a newer nForce2 board with an Athlon XP on ext3, so I don't
think it has to do with the IDE chipsets or filesystems. pdflush has
something to do with writing back FS data, and NFS is just the common
factor between systems that experience this problem. pdflush just seems to
hang when the system is heavily loaded and eat up all CPU resources even
when the system is otherwise idle. Renice failes to reschedule pdflush.
This didn't seem to be a problem with 2.6.5 or 2.4. Is there something I
can do to control pdflush or to provide more information?
Thanks
- Brent
Brent Cook <[email protected]> wrote:
>
> Running any kernel from the 2.6.6-rc* series (and a few previous
> -mm*'s),
It's a shame this wasn't reported earlier.
> the pdflush process starts using near 100% CPU indefinitely after
> a few minutes of initial NFS traffic, as far as I can tell.
Please confirm that the problem is observed on the NFS client and not the
NFS server? I'll assume the client.
What other filesystems are in use on the client?
Please describe the NFS mount options and the number of CPUs and the amount
of memory in the machine. And please send me your .config, off-list.
Thanks.
Brent Cook wrote:
> This didn't seem to be a problem with 2.6.5 or 2.4. Is there something I
> can do to control pdflush or to provide more information?
>
> Thanks
> - Brent
>
Yes regarding controlling pdflush, I don't know about providing pdflush
information. From Andrew Morton(Thanks Andrew!):
The tunables in /proc/sys/vm are documented in
Documentation/filesystems/proc.txt.
If you guys could please either cc or keep this on the
mailing list I would greatly appreciate it. I am
currently planning to deploy a dual Xeon FS with
2.6.5+akpm-mm6 and would like to know what the issue
is that Brent is seeing.
Thanks!
--- Andrew Morton <[email protected]> wrote:
> Brent Cook <[email protected]> wrote:
> >
> > Running any kernel from the 2.6.6-rc* series
> (and a few previous
> > -mm*'s),
>
> It's a shame this wasn't reported earlier.
>
> > the pdflush process starts using near 100% CPU
> indefinitely after
> > a few minutes of initial NFS traffic, as far as I
> can tell.
>
> Please confirm that the problem is observed on the
> NFS client and not the
> NFS server? I'll assume the client.
>
> What other filesystems are in use on the client?
>
> Please describe the NFS mount options and the number
> of CPUs and the amount
> of memory in the machine. And please send me your
> .config, off-list.
>
> Thanks.
> -
> To unsubscribe from this list: send the line
> "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
__________________________________
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs
http://hotjobs.sweepstakes.yahoo.com/careermakeover
Andrew Morton <[email protected]> wrote:
> It's a shame this wasn't reported earlier.
I did report this behaviour in <[email protected]> a few days
ago.
> Please confirm that the problem is observed on the NFS client and not the
> NFS server? I'll assume the client.
Shure! The problem is observed on the NFS client, a diskless machine in my
case.
> What other filesystems are in use on the client?
Non in my case!
> Please describe the NFS mount options and the number of CPUs and the amount
> of memory in the machine.
NFS mount options are default and my machine does not use an SMP Kernel (AMD
Athlon 2000+, Single CPU).
Sven
--
/*
* Wirzenius wrote this portably, Torvalds fucked it up :-)
*/ (taken from /usr/src/linux/lib/vsprintf.c)
/me is giggls@ircnet, http://sven.gegg.us/ on the Web
On Tue, 27 Apr 2004, Andrew Morton wrote:
> Brent Cook <[email protected]> wrote:
> >
> > Running any kernel from the 2.6.6-rc* series (and a few previous
> > -mm*'s),
>
> It's a shame this wasn't reported earlier.
Since it was a pretty big deal on my system, I just assumed it was for
other people's too, and that someone else would have reported it by
now. I only got concerned when it persisted between rc's.
> > the pdflush process starts using near 100% CPU indefinitely after
> > a few minutes of initial NFS traffic, as far as I can tell.
>
> Please confirm that the problem is observed on the NFS client and not the
> NFS server? I'll assume the client.
Yes, both affected machines had the issue when connecting as a client to a
2.4.25-based NFS server.
> What other filesystems are in use on the client?
One uses Reiser on /, the other uses ext3 on /. Here is the mount table
for one machine:
/dev/hda3 on / type ext3 (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
none on /proc type proc (rw)
none on /sys type sysfs (rw)
usbfs on /proc/bus/usb type usbfs (rw)
ozma:/home on /home type nfs (rw,addr=192.168.1.1)
Running 2.6.6-rc2-mm1,
Here is a shot compiling KDE with the source on the NFS mount,
-j2. This is the initial state:
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
12091 busterb 25 0 63524 59M 5140 R 38.4 23.8 0:19 0 cc1plus
12199 busterb 25 0 55660 52M 5140 R 38.0 20.8 0:07 0 cc1plus
7 root 16 0 0 0 0 SW 4.9 0.0 0:03 0 pdflush
About 10 minutes into the process, pdflush starts taking over:
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
7 root 25 0 0 0 0 RW 34.4 0.0 3:05 0 pdflush
17856 busterb 25 0 69400 65M 5140 R 34.4 26.1 0:31 0 cc1plus
19466 busterb 25 0 43732 39M 5140 R 26.3 15.5 0:03 0 cc1plus
After stopping the compile, pdflush remains until a reboot:
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
7 root 25 0 0 0 0 RW 98.0 0.0 3:21 0 pdflush
The network light will flash continually on each machine once pdflush
gets into this state, which makes me think NFS. Each machine has
512-256 MB of ram and a single CPU.
- Brent
Brent Cook <[email protected]> wrote:
>
> On Tue, 27 Apr 2004, Andrew Morton wrote:
>
> > Brent Cook <[email protected]> wrote:
> > >
> > > Running any kernel from the 2.6.6-rc* series (and a few previous
> > > -mm*'s),
> >
> > It's a shame this wasn't reported earlier.
>
> Since it was a pretty big deal on my system, I just assumed it was for
> other people's too, and that someone else would have reported it by
> now. I only got concerned when it persisted between rc's.
I think three people have reported it now.
> > > the pdflush process starts using near 100% CPU indefinitely after
> > > a few minutes of initial NFS traffic, as far as I can tell.
> >
> > Please confirm that the problem is observed on the NFS client and not the
> > NFS server? I'll assume the client.
>
> Yes, both affected machines had the issue when connecting as a client to a
> 2.4.25-based NFS server.
>
> > What other filesystems are in use on the client?
>
> One uses Reiser on /, the other uses ext3 on /. Here is the mount table
> for one machine:
Both machines are exhibiting the problem?
> About 10 minutes into the process, pdflush starts taking over:
>
> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> 7 root 25 0 0 0 0 RW 34.4 0.0 3:05 0 pdflush
> 17856 busterb 25 0 69400 65M 5140 R 34.4 26.1 0:31 0 cc1plus
> 19466 busterb 25 0 43732 39M 5140 R 26.3 15.5 0:03 0 cc1plus
> ...
> The network light will flash continually on each machine once pdflush
> gets into this state, which makes me think NFS. Each machine has
> 512-256 MB of ram and a single CPU.
ok.. I spent a couple of hours yesterday trying to get this to happen. No
joy. Can't make it happen with your .config either. I'll set up a 2.4.25
server later on.
What version of gcc are you using?
Could you please capture the contents of /proc/meminfo and /proc/vmstats
when it's happening?
Thanks.
On Wed, 28 Apr 2004, Andrew Morton wrote:
> Brent Cook <[email protected]> wrote:
> >
> > On Tue, 27 Apr 2004, Andrew Morton wrote:
> >
> > > Brent Cook <[email protected]> wrote:
>
> > > > the pdflush process starts using near 100% CPU indefinitely after
> > > > a few minutes of initial NFS traffic, as far as I can tell.
> > >
> > > Please confirm that the problem is observed on the NFS client and not the
> > > NFS server? I'll assume the client.
> >
> > Yes, both affected machines had the issue when connecting as a client to a
> > 2.4.25-based NFS server.
> >
> > > What other filesystems are in use on the client?
> >
> > One uses Reiser on /, the other uses ext3 on /. Here is the mount table
> > for one machine:
>
> Both machines are exhibiting the problem?
Yes. They both exhibit about 5-10 minutes after starting once I start
compiling from the NFS share. I'm going to also try a big compile from a
local directory to exercise the local FS to eliminate it as a possibility.
> > About 10 minutes into the process, pdflush starts taking over:
> >
> > PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> > 7 root 25 0 0 0 0 RW 34.4 0.0 3:05 0 pdflush
> > 17856 busterb 25 0 69400 65M 5140 R 34.4 26.1 0:31 0 cc1plus
> > 19466 busterb 25 0 43732 39M 5140 R 26.3 15.5 0:03 0 cc1plus
> > ...
> > The network light will flash continually on each machine once pdflush
> > gets into this state, which makes me think NFS. Each machine has
> > 512-256 MB of ram and a single CPU.
>
> ok.. I spent a couple of hours yesterday trying to get this to happen. No
> joy. Can't make it happen with your .config either. I'll set up a 2.4.25
> server later on.
>
> What version of gcc are you using?
3.3.3 previously. I just did a build with 3.4 with the same results; I'm
tracking Slackware-current, so my build environment reflects this.
> Could you please capture the contents of /proc/meminfo and /proc/vmstats
> when it's happening?
Will do. I can also provide a test account to you offline if it helps so
you can see it for yourself.
- Brent
> Could you please capture the contents of /proc/meminfo and /proc/vmstats
> when it's happening?
>
> Thanks.
>
Here is the top of top for one machine:
15:36:55 up 7:09, 1 user, load average: 1.00, 1.00, 1.00
48 processes: 46 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: 0.1% user 99.8% system 0.0% nice 0.0% iowait 0.0% idle
Mem: 256992k av, 117644k used, 139348k free, 0k shrd, 36464k buff
50968k active, 51592k inactive
Swap: 514040k av, 0k used, 514040k free 61644k cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
7 root 25 0 0 0 0 RW 99.4 0.0 415:26 0 pdflush
Here are memory stats for the same machine. The other machine's stats
are similar; there doesn't appear to be anything out of the ordinary and
its not even touching swap if these numbers are to be believed.
busterb@snowball2:~$ cat /proc/meminfo
MemTotal: 256992 kB
MemFree: 139700 kB
Buffers: 36464 kB
Cached: 61516 kB
SwapCached: 0 kB
Active: 50536 kB
Inactive: 51672 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 256992 kB
LowFree: 139700 kB
SwapTotal: 514040 kB
SwapFree: 514040 kB
Dirty: 1876 kB
Writeback: 0 kB
Mapped: 8552 kB
Slab: 10924 kB
Committed_AS: 14612 kB
PageTables: 356 kB
VmallocTotal: 778164 kB
VmallocUsed: 2936 kB
VmallocChunk: 774708 kB
busterb@snowball2:~$ cat /proc/vmstat
nr_dirty 469
nr_writeback 0
nr_unstable 0
nr_page_table_pages 89
nr_mapped 2138
nr_slab 2730
pgpgin 79849
pgpgout 121656
pswpin 0
pswpout 0
pgalloc_high 0
pgalloc_normal 1812796
pgalloc_dma 18991
pgfree 1866775
pgactivate 55529
pgdeactivate 14634
pgfault 3622942
pgmajfault 2322
pgrefill_high 0
pgrefill_normal 16726
pgrefill_dma 49791
pgsteal_high 0
pgsteal_normal 11781
pgsteal_dma 203
pgscan_kswapd_high 0
pgscan_kswapd_normal 10065
pgscan_kswapd_dma 231
pgscan_direct_high 0
pgscan_direct_normal 2310
pgscan_direct_dma 0
pginodesteal 0
slabs_scanned 4349
kswapd_steal 9758
kswapd_inodesteal 0
pageoutrun 32
allocstall 35
pgrotated 0
On Wed, 2004-04-28 at 16:39, Brent Cook wrote:
> > Could you please capture the contents of /proc/meminfo and /proc/vmstats
> > when it's happening?
> >
> > Thanks.
> >
>
> Here is the top of top for one machine:
>
> 15:36:55 up 7:09, 1 user, load average: 1.00, 1.00, 1.00
> 48 processes: 46 sleeping, 2 running, 0 zombie, 0 stopped
> CPU states: 0.1% user 99.8% system 0.0% nice 0.0% iowait 0.0% idle
> Mem: 256992k av, 117644k used, 139348k free, 0k shrd, 36464k buff
> 50968k active, 51592k inactive
> Swap: 514040k av, 0k used, 514040k free 61644k cached
>
> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> 7 root 25 0 0 0 0 RW 99.4 0.0 415:26 0 pdflush
Could you please also supply the mount options you are using as well as
the contents of /proc/mounts corresponding to your NFS partition.
Cheers,
Trond
Brent Cook <[email protected]> wrote:
>
> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> 7 root 25 0 0 0 0 RW 99.4 0.0 415:26 0 pdflush
This getting very irritating. Cannot reproduce it with a 2.4 server, gcc
3.4.0, 2.6.6-rc3 client. grrr.
Could you please apply the below two patches, then wait for pdflush to go
nuts, then do:
echo 1 > /proc/sys/debug/0
echo 0 > /proc/sys/debug/0
dmesg -s 1000000 > foo
then mail me foo? It probably won't tell me much, but one has to start
somewhere.
Thanks.
25-akpm/include/linux/kernel.h | 2 ++
25-akpm/kernel/sysctl.c | 19 +++++++++++++++++++
2 files changed, 21 insertions(+)
diff -puN include/linux/kernel.h~proc-sys-debug include/linux/kernel.h
--- 25/include/linux/kernel.h~proc-sys-debug Tue Apr 27 17:11:39 2004
+++ 25-akpm/include/linux/kernel.h Tue Apr 27 17:11:39 2004
@@ -220,6 +220,8 @@ extern void dump_stack(void);
1; \
})
+extern int proc_sys_debug[8];
+
#endif /* __KERNEL__ */
#define SI_LOAD_SHIFT 16
diff -puN kernel/sysctl.c~proc-sys-debug kernel/sysctl.c
--- 25/kernel/sysctl.c~proc-sys-debug Tue Apr 27 17:11:39 2004
+++ 25-akpm/kernel/sysctl.c Tue Apr 27 17:11:39 2004
@@ -888,7 +888,26 @@ static ctl_table fs_table[] = {
{ .ctl_name = 0 }
};
+int proc_sys_debug[8];
+EXPORT_SYMBOL(proc_sys_debug);
+
static ctl_table debug_table[] = {
+ {1, "0", &proc_sys_debug[0], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {2, "1", &proc_sys_debug[1], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {3, "2", &proc_sys_debug[2], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {4, "3", &proc_sys_debug[3], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {5, "4", &proc_sys_debug[4], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {6, "5", &proc_sys_debug[5], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {7, "6", &proc_sys_debug[6], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {8, "7", &proc_sys_debug[7], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
{ .ctl_name = 0 }
};
_
---
25-akpm/fs/fs-writeback.c | 22 ++++++++++++++++++++++
25-akpm/fs/mpage.c | 6 ++++++
2 files changed, 28 insertions(+)
diff -puN fs/fs-writeback.c~pdflush-debug fs/fs-writeback.c
--- 25/fs/fs-writeback.c~pdflush-debug Tue Apr 27 17:12:11 2004
+++ 25-akpm/fs/fs-writeback.c Tue Apr 27 17:19:21 2004
@@ -152,7 +152,23 @@ __sync_single_inode(struct inode *inode,
spin_unlock(&inode_lock);
+ if (proc_sys_debug[0]) {
+ printk("%s: writepages in nr_pages:%lu nr_to_write:%ld"
+ " pages_skipped:%ld en:%d\n",
+ __FUNCTION__,
+ mapping->nrpages, wbc->nr_to_write,
+ wbc->pages_skipped,
+ wbc->encountered_congestion);
+ }
ret = do_writepages(mapping, wbc);
+ if (proc_sys_debug[0]) {
+ printk("%s: writepages in nr_pages:%lu nr_to_write:%ld"
+ " pages_skipped:%ld en:%d\n",
+ __FUNCTION__,
+ mapping->nrpages, wbc->nr_to_write,
+ wbc->pages_skipped,
+ wbc->encountered_congestion);
+ }
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
@@ -328,6 +344,9 @@ sync_sb_inodes(struct super_block *sb, s
if (current_is_pdflush() && !writeback_acquire(bdi))
break;
+ if (proc_sys_debug[0]) {
+ printk("%s: write inode %p\n", __FUNCTION__, inode);
+ }
BUG_ON(inode->i_state & I_FREEING);
__iget(inode);
pages_skipped = wbc->pages_skipped;
@@ -384,6 +403,9 @@ writeback_inodes(struct writeback_contro
for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
spin_unlock(&sb_lock);
+ if (proc_sys_debug[0]) {
+ printk("%s: sync sb %p\n", __FUNCTION__, sb);
+ }
sync_sb_inodes(sb, wbc);
spin_lock(&sb_lock);
}
diff -puN fs/mpage.c~pdflush-debug fs/mpage.c
--- 25/fs/mpage.c~pdflush-debug Tue Apr 27 17:19:41 2004
+++ 25-akpm/fs/mpage.c Tue Apr 27 17:20:57 2004
@@ -658,6 +658,12 @@ retry:
if (writepage) {
ret = (*writepage)(page, wbc);
if (ret) {
+ if (proc_sys_debug[0]) {
+ printk("%s: writepage "
+ "returned %d\n",
+ __FUNCTION__,
+ ret);
+ }
if (ret == -ENOSPC)
set_bit(AS_ENOSPC,
&mapping->flags);
_
On Wed, 28 Apr 2004, Andrew Morton wrote:
> Brent Cook <[email protected]> wrote:
> >
> > PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> > 7 root 25 0 0 0 0 RW 99.4 0.0 415:26 0 pdflush
>
> This getting very irritating. Cannot reproduce it with a 2.4 server, gcc
> 3.4.0, 2.6.6-rc3 client. grrr.
>
> Could you please apply the below two patches, then wait for pdflush to go
> nuts, then do:
>
> echo 1 > /proc/sys/debug/0
> echo 0 > /proc/sys/debug/0
> dmesg -s 1000000 > foo
>
> then mail me foo? It probably won't tell me much, but one has to start
> somewhere.
>
That seems like a good start. I compiled arts from KDE from an NFS
directory, and about 4 minutes into it, pdflush appears to be hung-up
writing a single inode. dmesg just contains this over and over:
...
sync_sb_inodes: write inode c55d25bc
__sync_single_inode: writepages in nr_pages:25 nr_to_write:949
pages_skipped:0 en:0
__sync_single_inode: writepages in nr_pages:25 nr_to_write:949
pages_skipped:0 en:0
sync_sb_inodes: write inode c55d25bc
__sync_single_inode: writepages in nr_pages:25 nr_to_write:949
pages_skipped:0 en:0
__sync_single_inode: writepages in nr_pages:25 nr_to_write:949
pages_skipped:0 en:0
sync_sb_inodes: write inode c55d25bc
....
After rebooting, I tried repeating the experiment compiling from /tmp and
pdflush behaved. It didn't matter whether the NFS share was mounted at the
time, just whether the source was compiled from the share or elsewhere.
FYI, my fstab on this test machine (the PIII with 256MB/i815 chipset) is
pretty boring:
/dev/hda1 swap swap defaults 0 0
/dev/hda2 / reiserfs defaults 1 1
ozma:/home /home nfs rw 0 0
/dev/cdrom /mnt/cdrom iso9660 noauto,owner,ro 0 0
/dev/fd0 /mnt/floppy auto noauto,owner 0 0
devpts /dev/pts devpts gid=5,mode=620 0 0
proc /proc proc defaults 0 0
none /sys sysfs defaults 0 0
The share is exported from the 2.4.25 server as follows:
/home snowball2(rw,async,no_root_squash)
- Brent
Brent Cook <[email protected]> wrote:
>
> Here are memory stats for the same machine. The other machine's stats
> are similar; there doesn't appear to be anything out of the ordinary and
> its not even touching swap if these numbers are to be believed.
>
> busterb@snowball2:~$ cat /proc/meminfo
> MemTotal: 256992 kB
> MemFree: 139700 kB
> Buffers: 36464 kB
> Cached: 61516 kB
> SwapCached: 0 kB
> Active: 50536 kB
> Inactive: 51672 kB
> HighTotal: 0 kB
> HighFree: 0 kB
> LowTotal: 256992 kB
> LowFree: 139700 kB
> SwapTotal: 514040 kB
> SwapFree: 514040 kB
> Dirty: 1876 kB
> Writeback: 0 kB
> Mapped: 8552 kB
> Slab: 10924 kB
> Committed_AS: 14612 kB
> PageTables: 356 kB
> VmallocTotal: 778164 kB
> VmallocUsed: 2936 kB
> VmallocChunk: 774708 kB
Was this captured while pdflush was misbehaving?
On Wed, 28 Apr 2004, Trond Myklebust wrote:
> On Wed, 2004-04-28 at 16:39, Brent Cook wrote:
> > > Could you please capture the contents of /proc/meminfo and /proc/vmstats
> > > when it's happening?
> > >
> > > Thanks.
> > >
> >
> > Here is the top of top for one machine:
> >
> > 15:36:55 up 7:09, 1 user, load average: 1.00, 1.00, 1.00
> > 48 processes: 46 sleeping, 2 running, 0 zombie, 0 stopped
> > CPU states: 0.1% user 99.8% system 0.0% nice 0.0% iowait 0.0% idle
> > Mem: 256992k av, 117644k used, 139348k free, 0k shrd, 36464k buff
> > 50968k active, 51592k inactive
> > Swap: 514040k av, 0k used, 514040k free 61644k cached
> >
> > PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> > 7 root 25 0 0 0 0 RW 99.4 0.0 415:26 0 pdflush
>
> Could you please also supply the mount options you are using as well as
> the contents of /proc/mounts corresponding to your NFS partition.
>
> Cheers,
> Trond
Here is /proc/mounts on the aforementioned test system:
rootfs / rootfs rw 0 0
/dev/root / reiserfs rw 0 0
devpts /dev/pts devpts rw 0 0
proc /proc proc rw,nodiratime 0 0
none /sys sysfs rw 0 0
usbfs /proc/bus/usb usbfs rw 0 0
ozma:/home /home nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=ozma 0 0
- Brent
Brent Cook <[email protected]> wrote:
>
> sync_sb_inodes: write inode c55d25bc
> __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> pages_skipped:0 en:0
> __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> pages_skipped:0 en:0
uh-huh.
Does this fix it?
25-akpm/fs/fs-writeback.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)
diff -puN fs/fs-writeback.c~a fs/fs-writeback.c
--- 25/fs/fs-writeback.c~a 2004-04-28 21:01:37.012603336 -0700
+++ 25-akpm/fs/fs-writeback.c 2004-04-28 21:02:00.701002152 -0700
@@ -191,8 +191,8 @@ __sync_single_inode(struct inode *inode,
*/
inode->i_state |= I_DIRTY_PAGES;
inode->dirtied_when = jiffies;
- list_move(&inode->i_list, &sb->s_dirty);
}
+ list_move(&inode->i_list, &sb->s_dirty);
} else if (inode->i_state & I_DIRTY) {
/*
* Someone redirtied the inode while were writing back
_
Andrew Morton <[email protected]> wrote:
>
> Does this fix it?
If not, please try this new debug patch, against -rc3.
diff -puN include/linux/kernel.h~proc-sys-debug include/linux/kernel.h
--- 25/include/linux/kernel.h~proc-sys-debug Tue Apr 27 17:11:39 2004
+++ 25-akpm/include/linux/kernel.h Tue Apr 27 17:11:39 2004
@@ -220,6 +220,8 @@ extern void dump_stack(void);
1; \
})
+extern int proc_sys_debug[8];
+
#endif /* __KERNEL__ */
#define SI_LOAD_SHIFT 16
diff -puN kernel/sysctl.c~proc-sys-debug kernel/sysctl.c
--- 25/kernel/sysctl.c~proc-sys-debug Tue Apr 27 17:11:39 2004
+++ 25-akpm/kernel/sysctl.c Tue Apr 27 17:11:39 2004
@@ -888,7 +888,26 @@ static ctl_table fs_table[] = {
{ .ctl_name = 0 }
};
+int proc_sys_debug[8];
+EXPORT_SYMBOL(proc_sys_debug);
+
static ctl_table debug_table[] = {
+ {1, "0", &proc_sys_debug[0], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {2, "1", &proc_sys_debug[1], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {3, "2", &proc_sys_debug[2], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {4, "3", &proc_sys_debug[3], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {5, "4", &proc_sys_debug[4], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {6, "5", &proc_sys_debug[5], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {7, "6", &proc_sys_debug[6], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
+ {8, "7", &proc_sys_debug[7], sizeof(int), 0644, NULL,
+ &proc_dointvec_minmax, &sysctl_intvec, NULL, NULL, NULL},
{ .ctl_name = 0 }
};
_
---
25-akpm/fs/fs-writeback.c | 32 ++++++++++++++++++++++++++++++++
25-akpm/fs/mpage.c | 6 ++++++
2 files changed, 38 insertions(+)
diff -puN fs/fs-writeback.c~pdflush-debug fs/fs-writeback.c
--- 25/fs/fs-writeback.c~pdflush-debug 2004-04-28 20:52:05.820437744 -0700
+++ 25-akpm/fs/fs-writeback.c 2004-04-28 21:01:05.062460496 -0700
@@ -152,7 +152,23 @@ __sync_single_inode(struct inode *inode,
spin_unlock(&inode_lock);
+ if (proc_sys_debug[0]) {
+ printk("%s: writepages in nr_pages:%lu nr_to_write:%ld"
+ " pages_skipped:%ld en:%d\n",
+ __FUNCTION__,
+ mapping->nrpages, wbc->nr_to_write,
+ wbc->pages_skipped,
+ wbc->encountered_congestion);
+ }
ret = do_writepages(mapping, wbc);
+ if (proc_sys_debug[0]) {
+ printk("%s: writepages out nr_pages:%lu nr_to_write:%ld"
+ " pages_skipped:%ld en:%d\n",
+ __FUNCTION__,
+ mapping->nrpages, wbc->nr_to_write,
+ wbc->pages_skipped,
+ wbc->encountered_congestion);
+ }
/* Don't write the inode if only I_DIRTY_PAGES was set */
if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC))
@@ -180,6 +196,8 @@ __sync_single_inode(struct inode *inode,
* writeout as soon as the queue becomes
* uncongested.
*/
+ if (proc_sys_debug[0])
+ printk("%s:%d\n", __FILE__, __LINE__);
inode->i_state |= I_DIRTY_PAGES;
} else {
/*
@@ -192,22 +210,30 @@ __sync_single_inode(struct inode *inode,
inode->i_state |= I_DIRTY_PAGES;
inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
+ if (proc_sys_debug[0])
+ printk("%s:%d\n", __FILE__, __LINE__);
}
} else if (inode->i_state & I_DIRTY) {
/*
* Someone redirtied the inode while were writing back
* the pages: nothing to do.
*/
+ if (proc_sys_debug[0])
+ printk("%s:%d\n", __FILE__, __LINE__);
} else if (atomic_read(&inode->i_count)) {
/*
* The inode is clean, inuse
*/
list_move(&inode->i_list, &inode_in_use);
+ if (proc_sys_debug[0])
+ printk("%s:%d\n", __FILE__, __LINE__);
} else {
/*
* The inode is clean, unused
*/
list_move(&inode->i_list, &inode_unused);
+ if (proc_sys_debug[0])
+ printk("%s:%d\n", __FILE__, __LINE__);
}
}
wake_up_inode(inode);
@@ -328,6 +354,9 @@ sync_sb_inodes(struct super_block *sb, s
if (current_is_pdflush() && !writeback_acquire(bdi))
break;
+ if (proc_sys_debug[0]) {
+ printk("%s: write inode %p\n", __FUNCTION__, inode);
+ }
BUG_ON(inode->i_state & I_FREEING);
__iget(inode);
pages_skipped = wbc->pages_skipped;
@@ -384,6 +413,9 @@ writeback_inodes(struct writeback_contro
for (; sb != sb_entry(&super_blocks); sb = sb_entry(sb->s_list.prev)) {
if (!list_empty(&sb->s_dirty) || !list_empty(&sb->s_io)) {
spin_unlock(&sb_lock);
+ if (proc_sys_debug[0]) {
+ printk("%s: sync sb %p\n", __FUNCTION__, sb);
+ }
sync_sb_inodes(sb, wbc);
spin_lock(&sb_lock);
}
diff -puN fs/mpage.c~pdflush-debug fs/mpage.c
--- 25/fs/mpage.c~pdflush-debug 2004-04-28 20:52:05.821437592 -0700
+++ 25-akpm/fs/mpage.c 2004-04-28 20:52:05.825436984 -0700
@@ -658,6 +658,12 @@ retry:
if (writepage) {
ret = (*writepage)(page, wbc);
if (ret) {
+ if (proc_sys_debug[0]) {
+ printk("%s: writepage "
+ "returned %d\n",
+ __FUNCTION__,
+ ret);
+ }
if (ret == -ENOSPC)
set_bit(AS_ENOSPC,
&mapping->flags);
_
On Wed, 2004-04-28 at 23:55, Brent Cook wrote:
> ozma:/home /home nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=ozma 0 0
OK, then it's not the case that it is doing synchronous I/O.
I see that we're failing to set wbc->encountered_congestion in the case
where a nonblocking writeback is forced to exit due to congestion. Could
that be causing pdflush to loop Andrew?
If so, does the following patch help?
--- linux-2.6.6-rc3/fs/nfs/write.c.orig 2004-04-28 22:25:46.000000000 -0400
+++ linux-2.6.6-rc3/fs/nfs/write.c 2004-04-29 00:06:25.000000000 -0400
@@ -347,8 +347,10 @@ int nfs_writepages(struct address_space
if (err)
return err;
while (test_and_set_bit(BDI_write_congested, &bdi->state) != 0) {
- if (wbc->nonblocking)
+ if (wbc->nonblocking) {
+ wbc->encountered_congestion = 1;
return 0;
+ }
nfs_wait_on_write_congestion(mapping, 0);
}
err = nfs_flush_inode(inode, 0, 0, wb_priority(wbc));
On Wed, 28 Apr 2004, Andrew Morton wrote:
> Brent Cook <[email protected]> wrote:
> >
> > sync_sb_inodes: write inode c55d25bc
> > __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> > pages_skipped:0 en:0
> > __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> > pages_skipped:0 en:0
>
> uh-huh.
>
> Does this fix it?
I'm going to run a compile/load test overnight, but the test that
triggered it every time previously failed to do so with this patch.
pdflush is behaving so far, and I'll say you've figured it out for now,
with the final verdict in about 8 hours.
Does this mean that, if there were too many dirty pages and not enough
time to write them all back, that the dirty page list just stopped being
traversed, stuck on a single page? That would make explain why this was
easier to trigger on NFS than a local FS, due to the extra latency, etc.
Maybe I should look into tuning my NFS server while we're at it, as yours
must be much faster ;)
- Brent
> 25-akpm/fs/fs-writeback.c | 2 +-
> 1 files changed, 1 insertion(+), 1 deletion(-)
>
> diff -puN fs/fs-writeback.c~a fs/fs-writeback.c
> --- 25/fs/fs-writeback.c~a 2004-04-28 21:01:37.012603336 -0700
> +++ 25-akpm/fs/fs-writeback.c 2004-04-28 21:02:00.701002152 -0700
> @@ -191,8 +191,8 @@ __sync_single_inode(struct inode *inode,
> */
> inode->i_state |= I_DIRTY_PAGES;
> inode->dirtied_when = jiffies;
> - list_move(&inode->i_list, &sb->s_dirty);
> }
> + list_move(&inode->i_list, &sb->s_dirty);
> } else if (inode->i_state & I_DIRTY) {
> /*
> * Someone redirtied the inode while were writing back
>
> _
>
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2004-04-28 at 23:55, Brent Cook wrote:
>
> > ozma:/home /home nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=ozma 0 0
>
> OK, then it's not the case that it is doing synchronous I/O.
>
> I see that we're failing to set wbc->encountered_congestion in the case
> where a nonblocking writeback is forced to exit due to congestion. Could
> that be causing pdflush to loop Andrew?
It should be OK - after writepages completes the inode-level writeback code
will inspect BDI_write_congested and will propagate that into the
writeback_control for the top-level writeback code to ponder.
The pdflush silliness has been seen on smbfs and reiserfs too, so...
It looks like the problem is that the inode is stuck on the superblock's
s_io list, and we keep calling writepage(s) and writepage(s) keeps on
redirtying the page rather than writing it. If so, moving the page off
s_io should fix it up.
Brent Cook <[email protected]> wrote:
>
> On Wed, 28 Apr 2004, Andrew Morton wrote:
>
> > Brent Cook <[email protected]> wrote:
> > >
> > > sync_sb_inodes: write inode c55d25bc
> > > __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> > > pages_skipped:0 en:0
> > > __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> > > pages_skipped:0 en:0
> >
> > uh-huh.
> >
> > Does this fix it?
>
> I'm going to run a compile/load test overnight, but the test that
> triggered it every time previously failed to do so with this patch.
OK, thanks. A better patch would be:
diff -puN fs/fs-writeback.c~writeback-livelock-fix-2 fs/fs-writeback.c
--- 25/fs/fs-writeback.c~writeback-livelock-fix-2 2004-04-28 21:19:32.779061976 -0700
+++ 25-akpm/fs/fs-writeback.c 2004-04-28 21:20:11.080239312 -0700
@@ -176,11 +176,12 @@ __sync_single_inode(struct inode *inode,
if (wbc->for_kupdate) {
/*
* For the kupdate function we leave the inode
- * where it is on sb_dirty so it will get more
+ * at the head of sb_dirty so it will get more
* writeout as soon as the queue becomes
* uncongested.
*/
inode->i_state |= I_DIRTY_PAGES;
+ list_move_tail(&inode->i_list, &sb->s_dirty);
} else {
/*
* Otherwise fully redirty the inode so that
_
> pdflush is behaving so far, and I'll say you've figured it out for now,
> with the final verdict in about 8 hours.
>
> Does this mean that, if there were too many dirty pages and not enough
> time to write them all back, that the dirty page list just stopped being
> traversed, stuck on a single page?
No.. There's all sorts of livelock avoidance code in there and I keep on
forgetting that sometimes writepage won't write the dang page at all -
instead it just redirties the page (and hence the inode).
Now, that redirtying of the inode _should_ have moved the inode off the
s_io list and onto the s_dirty list. But for some reason it looks like it
didn't, so we get stuck in a loop. I need to think about it a bit more.
Andrew Morton <[email protected]> wrote:
>
> Now, that redirtying of the inode _should_ have moved the inode off the
> s_io list and onto the s_dirty list. But for some reason it looks like it
> didn't, so we get stuck in a loop. I need to think about it a bit more.
OK, it looks like nfs_writepages() might have been encountered the
congested flag and it baled out without doing anything - the inode is still
on the temporary s_io list, and no pages were redirtied, hence the inode
wasn't redirtied, hence it remains stuck on the s_io list. The patch I
sent will fix that up.
On Wed, 28 Apr 2004, Andrew Morton wrote:
> Brent Cook <[email protected]> wrote:
> >
> > On Wed, 28 Apr 2004, Andrew Morton wrote:
> >
> > > Brent Cook <[email protected]> wrote:
> > > >
> > > > sync_sb_inodes: write inode c55d25bc
> > > > __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> > > > pages_skipped:0 en:0
> > > > __sync_single_inode: writepages in nr_pages:25 nr_to_write:949
> > > > pages_skipped:0 en:0
> > >
> > > uh-huh.
> > >
> > > Does this fix it?
> >
> > I'm going to run a compile/load test overnight, but the test that
> > triggered it every time previously failed to do so with this patch.
>
> OK, thanks. A better patch would be:
No, thank you! The overnight test was successful. I have been running this
better patch for a little while, and it is no worse. I think you have
solved the bigger problem, which was the runaway process, at least for me.
So, moving it to the tail of the s_dirty list now puts that page in a
higher-priority to be written back next time? That sounds better than just
redirtying it; the poor inode has been through enough as it is without
having to wait even longer.
If you want to think about it a little more, pdflush on 2.6.6-rc3 with
this patch still seems to use more resources than it did on 2.6.5. With
heavy NFS traffic, it still uses about 2-3% CPU on 2.6.6-rc3, but on 2.6.5
it averages about 0.1%. Maybe it just wasn't being used to its full
potential in 2.6.5?
Thanks
- Brent
>
> diff -puN fs/fs-writeback.c~writeback-livelock-fix-2 fs/fs-writeback.c
> --- 25/fs/fs-writeback.c~writeback-livelock-fix-2 2004-04-28 21:19:32.779061976 -0700
> +++ 25-akpm/fs/fs-writeback.c 2004-04-28 21:20:11.080239312 -0700
> @@ -176,11 +176,12 @@ __sync_single_inode(struct inode *inode,
> if (wbc->for_kupdate) {
> /*
> * For the kupdate function we leave the inode
> - * where it is on sb_dirty so it will get more
> + * at the head of sb_dirty so it will get more
> * writeout as soon as the queue becomes
> * uncongested.
> */
> inode->i_state |= I_DIRTY_PAGES;
> + list_move_tail(&inode->i_list, &sb->s_dirty);
> } else {
> /*
> * Otherwise fully redirty the inode so that
>
> _
>
> > pdflush is behaving so far, and I'll say you've figured it out for now,
> > with the final verdict in about 8 hours.
> >
> > Does this mean that, if there were too many dirty pages and not enough
> > time to write them all back, that the dirty page list just stopped being
> > traversed, stuck on a single page?
>
> No.. There's all sorts of livelock avoidance code in there and I keep on
> forgetting that sometimes writepage won't write the dang page at all -
> instead it just redirties the page (and hence the inode).
>
> Now, that redirtying of the inode _should_ have moved the inode off the
> s_io list and onto the s_dirty list. But for some reason it looks like it
> didn't, so we get stuck in a loop. I need to think about it a bit more.