I seem to have hit the same issue as Fabio Brugnara
http://www.ussg.iu.edu/hypermail/linux/kernel/0505.0/1061.html
but although he said the problem went away for him in 2.6.11, I saw the problem
in 2.6.8, and still see it in 2.6.14.
When using a sliding mmap window over a file on nfs, the system time increases
with time, and the throughput drops.
Using the same program on a local file I get about 30MB/s with very little
system time... just a bunch of iowait.
I am using a gigabit connection to a netapp, and initially get 30+MB/s, but
after about 30 seconds, this decays to only about 8MB/s with the system time
close to 80%.
I did a quick Oprofile run, and here is the top of the list:
1739810 16.8926 vmlinux-2.6.14 vmlinux-2.6.14
find_get_pages_tag
1305104 12.6718 vmlinux-2.6.14 vmlinux-2.6.14
mpage_writepages
1041147 10.1089 vmlinux-2.6.14 vmlinux-2.6.14 unlock_page
998692 9.6967 vmlinux-2.6.14 vmlinux-2.6.14
clear_page_dirty_for_io
966458 9.3838 vmlinux-2.6.14 vmlinux-2.6.14
release_pages
674434 6.5484 vmlinux-2.6.14 vmlinux-2.6.14
pci_conf1_write
486345 4.7221 vmlinux-2.6.14 vmlinux-2.6.14
__lookup_tag
399627 3.8801 vmlinux-2.6.14 vmlinux-2.6.14
page_waitqueue
134250 1.3035 vmlinux-2.6.14 vmlinux-2.6.14
_spin_lock_irqsave
130594 1.2680 vmlinux-2.6.14 vmlinux-2.6.14
_read_lock_irqsave
I'm attaching the config.
Other details:
Dual Xeon 2.66 w/ 2GB RAM
Debian sarge + 2.6.14 kernel + glibc 2.3.5
nvidia drivers
broadcom gigabit driver (8.2.18)
thanks,
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Just another data point....
If I open the file with O_DIRECT.. not much changes:
samples % image name app name symbol name
12585321 18.9373 vmlinux-2.6.14 vmlinux-2.6.14 find_get_pages_tag
8608887 12.9539 vmlinux-2.6.14 vmlinux-2.6.14 mpage_writepages
6870600 10.3383 vmlinux-2.6.14 vmlinux-2.6.14 unlock_page
6605417 9.9393 vmlinux-2.6.14 vmlinux-2.6.14 clear_page_dirty_for_io
6259207 9.4183 vmlinux-2.6.14 vmlinux-2.6.14 release_pages
3249493 4.8896 vmlinux-2.6.14 vmlinux-2.6.14 __lookup_tag
3248871 4.8886 vmlinux-2.6.14 vmlinux-2.6.14 pci_conf1_write
2677914 4.0295 vmlinux-2.6.14 vmlinux-2.6.14 page_waitqueue
982811 1.4789 vmlinux-2.6.14 vmlinux-2.6.14 _read_lock_irqsave
917165 1.3801 vmlinux-2.6.14 vmlinux-2.6.14 _read_unlock_irq
758960 1.1420 vmlinux-2.6.14 vmlinux-2.6.14 __wake_up_bit
706607 1.0632 vmlinux-2.6.14 vmlinux-2.6.14 _spin_lock_irqsave
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
On Tue, 2005-11-08 at 11:25 -0800, Kenny Simpson wrote:
> Just another data point....
> If I open the file with O_DIRECT.. not much changes:
Hmm... Are you mounting using the -osync or -onoac options? Doing
synchronous writes will tend to slow down flushing considerably, and the
VM appears to be very fragile w.r.t. slow filesystems.
Cheers,
Trond
> samples % image name app name symbol name
> 12585321 18.9373 vmlinux-2.6.14 vmlinux-2.6.14 find_get_pages_tag
> 8608887 12.9539 vmlinux-2.6.14 vmlinux-2.6.14 mpage_writepages
> 6870600 10.3383 vmlinux-2.6.14 vmlinux-2.6.14 unlock_page
> 6605417 9.9393 vmlinux-2.6.14 vmlinux-2.6.14 clear_page_dirty_for_io
> 6259207 9.4183 vmlinux-2.6.14 vmlinux-2.6.14 release_pages
> 3249493 4.8896 vmlinux-2.6.14 vmlinux-2.6.14 __lookup_tag
> 3248871 4.8886 vmlinux-2.6.14 vmlinux-2.6.14 pci_conf1_write
> 2677914 4.0295 vmlinux-2.6.14 vmlinux-2.6.14 page_waitqueue
> 982811 1.4789 vmlinux-2.6.14 vmlinux-2.6.14 _read_lock_irqsave
> 917165 1.3801 vmlinux-2.6.14 vmlinux-2.6.14 _read_unlock_irq
> 758960 1.1420 vmlinux-2.6.14 vmlinux-2.6.14 __wake_up_bit
> 706607 1.0632 vmlinux-2.6.14 vmlinux-2.6.14 _spin_lock_irqsave
>
>
> -Kenny
>
>
>
>
>
> __________________________________
> Yahoo! Mail - PC Magazine Editors' Choice 2005
> http://mail.yahoo.com
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
I knew I forgoet something important in the info:
type nfs (rw,tcp,rsize=32768,wsize=32768,hard,intr,vers=3,tcp,rsize=32768,wsize=32768,hard,intr
-Kenny
--- Trond Myklebust <[email protected]> wrote:
> On Tue, 2005-11-08 at 11:25 -0800, Kenny Simpson wrote:
> > Just another data point....
> > If I open the file with O_DIRECT.. not much changes:
>
> Hmm... Are you mounting using the -osync or -onoac options? Doing
> synchronous writes will tend to slow down flushing considerably, and the
> VM appears to be very fragile w.r.t. slow filesystems.
>
> Cheers,
> Trond
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
I am attaching a sample piece of code to show the behavior.
This simply tries to grow a file as fast as it can using different methods.
Use with caution as it does not stop, and will fill the disk if you let it run.
Run as "nfstest -m <filename>" where <filename> is on an nfs mount.
-Kenny
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
I ran the same test again against 2.6.15-rc, and got pretty much the same thing. It starts nice
and fast (30+MB/s, but drops down to under 10MB/s with the system time pegging one CPU).
Here is the oprofile result:
CPU: P4 / Xeon with 2 hyper-threads, speed 2658.47 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask
of 0x01 (mandatory) count 100000
samples % symbol name
412585 14.6687 find_get_pages_tag
343898 12.2267 mpage_writepages
290144 10.3155 release_pages
288631 10.2617 unlock_page
286181 10.1746 pci_conf1_write
267619 9.5147 clear_page_dirty_for_io
128128 4.5554 __lookup_tag
120895 4.2982 page_waitqueue
52739 1.8750 _spin_lock_irqsave
43623 1.5509 skb_copy_bits
30157 1.0722 __wake_up_bit
29973 1.0656 _read_lock_irqsave
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
On Tue, Nov 15, 2005 at 03:47:30PM -0800, Kenny Simpson wrote:
> CPU: P4 / Xeon with 2 hyper-threads, speed 2658.47 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask
> of 0x01 (mandatory) count 100000
> samples % symbol name
> 412585 14.6687 find_get_pages_tag
> 343898 12.2267 mpage_writepages
> 290144 10.3155 release_pages
> 288631 10.2617 unlock_page
> 286181 10.1746 pci_conf1_write
> 267619 9.5147 clear_page_dirty_for_io
> 128128 4.5554 __lookup_tag
> 120895 4.2982 page_waitqueue
> 52739 1.8750 _spin_lock_irqsave
> 43623 1.5509 skb_copy_bits
> 30157 1.0722 __wake_up_bit
> 29973 1.0656 _read_lock_irqsave
67%, or 2/3 of the samples, are in the top 6 functions. Have you tried
instruction-level profiling? It would be interesting to see what
codepaths within the functions are the largest offenders.
-- wli
--- William Lee Irwin III <[email protected]> wrote:
> 67%, or 2/3 of the samples, are in the top 6 functions. Have you tried
> instruction-level profiling? It would be interesting to see what
> codepaths within the functions are the largest offenders.
>
>
> -- wli
>
I'm a little new to oprofile, but I'm willing to try any configuration or set of flags that could
be useful.
Are you referring to the -d option in opreport?
--details / -d
Show per-instruction details for all selected symbols.
I'll give it a go when I get back to work.
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Kenny Simpson <[email protected]> wrote:
>
> I ran the same test again against 2.6.15-rc, and got pretty much the same thing. It starts nice
> and fast (30+MB/s, but drops down to under 10MB/s with the system time pegging one CPU).
>
> Here is the oprofile result:
>
> CPU: P4 / Xeon with 2 hyper-threads, speed 2658.47 MHz (estimated)
> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask
> of 0x01 (mandatory) count 100000
> samples % symbol name
> 412585 14.6687 find_get_pages_tag
> 343898 12.2267 mpage_writepages
> 290144 10.3155 release_pages
> 288631 10.2617 unlock_page
> 286181 10.1746 pci_conf1_write
> 267619 9.5147 clear_page_dirty_for_io
> 128128 4.5554 __lookup_tag
> 120895 4.2982 page_waitqueue
> 52739 1.8750 _spin_lock_irqsave
> 43623 1.5509 skb_copy_bits
> 30157 1.0722 __wake_up_bit
> 29973 1.0656 _read_lock_irqsave
>
Your application walks the file in 2MB hunks, doing ftruncate() each time
to expand the file by another 2MB.
nfs_setattr() implements the truncate. It syncs the whole file, using
filemap_write_and_wait() (that seems a bit suboptimal. All we're doing is
increasing i_size??)
So filemap_write_and_wait() has to write 2MB's worth of pages. Problem is,
_all_ the pages, even the 99% which are clean are tagged as dirty in the
pagecache radix tree. So find_get_pages_tag() ends up visiting each page
in the file, and blows much CPU doing so.
The writeout happens in mpage_writepages(), which uses
clear_page_dirty_for_io() to clear PG_dirty. But it doesn't clear the
dirty tag in the radix tree. It relies upon the filesystem to do the right
thing later on. Which is all very unpleasant, sorry. See the explanatory
comment over clear_page_dirty_for_io().
nfs_writepage() doesn't do any of the things which that comment says it
should, hence the radix tree tags are getting out of sync, hence this
problem.
NFS does strange, incomprehensible-to-little-akpms things in its writeout
path. Ideally, it should run set_page_writeback() prior to unlocking the
page and end_page_writeback() when I/O completes. That'll keep the VM
happier while fixing this performance glitch.
On Tue, 2005-11-15 at 23:45 -0800, Andrew Morton wrote:
> So filemap_write_and_wait() has to write 2MB's worth of pages. Problem is,
> _all_ the pages, even the 99% which are clean are tagged as dirty in the
> pagecache radix tree. So find_get_pages_tag() ends up visiting each page
> in the file, and blows much CPU doing so.
>
> The writeout happens in mpage_writepages(), which uses
> clear_page_dirty_for_io() to clear PG_dirty. But it doesn't clear the
> dirty tag in the radix tree. It relies upon the filesystem to do the right
> thing later on. Which is all very unpleasant, sorry. See the explanatory
> comment over clear_page_dirty_for_io().
> nfs_writepage() doesn't do any of the things which that comment says it
> should, hence the radix tree tags are getting out of sync, hence this
> problem.
>
> NFS does strange, incomprehensible-to-little-akpms things in its writeout
> path. Ideally, it should run set_page_writeback() prior to unlocking the
> page and end_page_writeback() when I/O completes. That'll keep the VM
> happier while fixing this performance glitch.
Actually that will screw over performance even further by forcing us to
send out loads of little RPC requests to write 4k pages instead of
allowing us to gather those writes into 32k (or larger) chunks.
Anyhow, does the following patch help?
Cheers,
Trond
------
NFS: resync to yet more writepage() changes...
Ensure that we call clear_page_dirty() for pages that have been written
via writepage().
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 8f71e76..ea77da5 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -213,6 +213,7 @@ static int nfs_writepage_sync(struct nfs
} while (count);
/* Update file length */
nfs_grow_file(page, offset, written);
+ clear_page_dirty(page);
/* Set the PG_uptodate flag? */
nfs_mark_uptodate(page, offset, written);
@@ -238,6 +239,7 @@ static int nfs_writepage_async(struct nf
goto out;
/* Update file length */
nfs_grow_file(page, offset, count);
+ clear_page_dirty(page);
/* Set the PG_uptodate flag? */
nfs_mark_uptodate(page, offset, count);
nfs_unlock_request(req);
--- Trond Myklebust <[email protected]> wrote:
> Anyhow, does the following patch help?
Unfortunately, not:
samples % symbol name
545009 15.2546 find_get_pages_tag
450595 12.6120 mpage_writepages
383196 10.7255 release_pages
381479 10.6775 unlock_page
351513 9.8387 clear_page_dirty_for_io
317784 8.8947 pci_conf1_write
167918 4.7000 __lookup_tag
160701 4.4980 page_waitqueue
59142 1.6554 _spin_lock_irqsave
47655 1.3338 skb_copy_bits
39136 1.0954 __wake_up_bit
38143 1.0676 _read_lock_irqsave
With reducing the window size to 32k, things aren't much different:
samples % symbol name
474589 21.2001 find_get_pages_tag
370512 16.5509 mpage_writepages
310556 13.8727 release_pages
302571 13.5160 unlock_page
286541 12.7999 clear_page_dirty_for_io
119717 5.3478 page_waitqueue
109920 4.9102 __lookup_tag
33313 1.4881 pci_conf1_write
29198 1.3043 __wake_up_bit
27075 1.2095 _read_lock_irqsave
25009 1.1172 _read_unlock_irq
... except the performance is much worse than with the 2M buffer (hence the 2M choice). With the
smaller buffer, the throughput starts at 8M/sec and quickly drops to 1M/sec.
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
On Wed, 2005-11-16 at 07:01 -0800, Kenny Simpson wrote:
> --- Trond Myklebust <[email protected]> wrote:
> > Anyhow, does the following patch help?
>
> Unfortunately, not:
>
> samples % symbol name
> 545009 15.2546 find_get_pages_tag
Argh... I totally missed the point there with the last patch. We should
be resyncing the page tag with the value of the PG_dirty flag...
OK, please back out the patch that I sent you, and try this one instead.
Cheers,
Trond
------
NFS: resync to yet more writepage() changes...
Ensure that we call clear_page_dirty() for pages that have been written
via writepage().
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 2 ++
include/linux/mm.h | 1 +
mm/page-writeback.c | 20 ++++++++++++++++++++
3 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 8f71e76..61ec355 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -213,6 +213,7 @@ static int nfs_writepage_sync(struct nfs
} while (count);
/* Update file length */
nfs_grow_file(page, offset, written);
+ clear_page_dirty_tag(page);
/* Set the PG_uptodate flag? */
nfs_mark_uptodate(page, offset, written);
@@ -238,6 +239,7 @@ static int nfs_writepage_async(struct nf
goto out;
/* Update file length */
nfs_grow_file(page, offset, count);
+ clear_page_dirty_tag(page);
/* Set the PG_uptodate flag? */
nfs_mark_uptodate(page, offset, count);
nfs_unlock_request(req);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1013a42..cb1cfe1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -730,6 +730,7 @@ int redirty_page_for_writepage(struct wr
int FASTCALL(set_page_dirty(struct page *page));
int set_page_dirty_lock(struct page *page);
int clear_page_dirty_for_io(struct page *page);
+int clear_page_dirty_tag(struct page *page);
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 74138c9..65c58fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -751,6 +751,26 @@ int clear_page_dirty_for_io(struct page
return TestClearPageDirty(page);
}
+/*
+ * Clears the page dirty tag. See comment in clear_page_dirty_for_io()
+ */
+int clear_page_dirty_tag(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+
+ if (mapping) {
+ unsigned long flags;
+
+ write_lock_irqsave(&mapping->tree_lock, flags);
+ if (!PageDirty(page))
+ radix_tree_tag_clear(&mapping->page_tree,
+ page_index(page),
+ PAGECACHE_TAG_DIRTY);
+ write_unlock_irqrestore(&mapping->tree_lock, flags);
+ }
+}
+EXPORT_SYMBOL(clear_page_dirty_tag);
+
int test_clear_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2005-11-16 at 07:01 -0800, Kenny Simpson wrote:
> > --- Trond Myklebust <[email protected]> wrote:
> > > Anyhow, does the following patch help?
> >
> > Unfortunately, not:
> >
> > samples % symbol name
> > 545009 15.2546 find_get_pages_tag
>
> Argh... I totally missed the point there with the last patch. We should
> be resyncing the page tag with the value of the PG_dirty flag...
>
> OK, please back out the patch that I sent you, and try this one instead.
>
> ...
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 8f71e76..61ec355 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -213,6 +213,7 @@ static int nfs_writepage_sync(struct nfs
> } while (count);
> /* Update file length */
> nfs_grow_file(page, offset, written);
> + clear_page_dirty_tag(page);
> /* Set the PG_uptodate flag? */
> nfs_mark_uptodate(page, offset, written);
>
> ....
> +int clear_page_dirty_tag(struct page *page)
> +{
> + struct address_space *mapping = page_mapping(page);
> +
> + if (mapping) {
> + unsigned long flags;
> +
> + write_lock_irqsave(&mapping->tree_lock, flags);
> + if (!PageDirty(page))
> + radix_tree_tag_clear(&mapping->page_tree,
> + page_index(page),
> + PAGECACHE_TAG_DIRTY);
> + write_unlock_irqrestore(&mapping->tree_lock, flags);
> + }
> +}
That will fix it, but the PageWriteback accounting is still wrong.
Is it not possible to use set_page_writeback()/end_page_writeback()?
Are these pages marked "unstable" at this time?
On Wed, 2005-11-16 at 10:00 -0800, Andrew Morton wrote:
> That will fix it, but the PageWriteback accounting is still wrong.
>
> Is it not possible to use set_page_writeback()/end_page_writeback()?
Not really. The pages aren't flushed at this time. We the point is to
gather several pages and coalesce them into one over-the-wire RPC call.
That means we cannot really do it from inside ->writepage().
We do start the actual RPC calls in ->writepages(), though.
> Are these pages marked "unstable" at this time?
No. "unstable" means that the RPC call to send the pages to the server
has completed, but the pages have not been flushed to disk by the
server. In this case we haven't even sent the pages to the server.
Instead the pages are accounted for in nr_dirty, and are tracked by the
internal NFS 'dirty request' lists. We also mark the inode as being
dirty in order to ensure that pdflush will kick off the actual RPC calls
if nobody else does so first.
Cheers,
Trond
On Wed, Nov 16, 2005 at 01:34:22PM -0500, Trond Myklebust wrote:
> Not really. The pages aren't flushed at this time. We the point is to
> gather several pages and coalesce them into one over-the-wire RPC call.
> That means we cannot really do it from inside ->writepage().
>
> We do start the actual RPC calls in ->writepages(), though.
This is a problem we have in various filesystems. Except for really
bad OOM situations the filesystem should never get a writeout request
for a single file. We should really stop having ->writepage called by
the VM and move this kind of batching code into the VM. I'm runnin into
similar issues for XFS and unwritten/delayed extent conversion once again.
--- Trond Myklebust <[email protected]> wrote:
> OK, please back out the patch that I sent you, and try this one instead.
THAT'S IT!
Very nice.. 30MB+/sec sustained for several minutes..
only 25% system CPU and the new profile is:
samples % symbol name
1047754 32.9054 pci_conf1_write
193876 6.0888 _spin_lock_irqsave
152897 4.8018 skb_copy_bits
74745 2.3474 _spin_lock
73273 2.3012 __copy_from_user_ll
69273 2.1756 __lookup_tag
65084 2.0440 _spin_unlock_irqrestore
43803 1.3757 sub_preempt_count
32047 1.0065 tcp_v4_rcv
30895 0.9703 schedule
26161 0.8216 kfree
Thank you!
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
--- Trond Myklebust <[email protected]> wrote:
> OK, please back out the patch that I sent you, and try this one instead.
With jumbo frames, the profile is even happier:
(throughput is a little higher and CPU usage is a little lower too)
samples % symbol name
74463 12.1129 skb_copy_bits
30351 4.9372 __lookup_tag
24520 3.9887 _spin_lock
20353 3.3108 _spin_lock_irqsave
19306 3.1405 __copy_from_user_ll
15393 2.5040 __copy_user_zeroing_intel
10014 1.6290 isolate_lru_pages
9002 1.4644 sub_preempt_count
7997 1.3009 debug_smp_processor_id
7691 1.2511 schedule
6999 1.1385 shrink_list
6699 1.0897 tcp_sendmsg
6669 1.0848 radix_tree_delete
6532 1.0626 _write_lock_irqsave
6413 1.0432 __mod_page_state
6170 1.0037 acpi_safe_halt
Again... this is excellent.
So will this make 2.6.16? or can this be called a bug fix for 2.6.15?
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2005-11-16 at 10:00 -0800, Andrew Morton wrote:
>
> > That will fix it, but the PageWriteback accounting is still wrong.
> >
> > Is it not possible to use set_page_writeback()/end_page_writeback()?
>
> Not really. The pages aren't flushed at this time. We the point is to
> gather several pages and coalesce them into one over-the-wire RPC call.
> That means we cannot really do it from inside ->writepage().
>
I still don't get it.
Once nfs_writepage() has been called, the page is conceptually "under
writeback", yes? In that, at some point in the future, it will be written
to backing store.
Hence it's perfectly appropriate to run set_page_writepage() within
nfs_writepage(). It's a matter of finding the right place for the
end_page_writeback().
On Wed, 2005-11-16 at 11:09 -0800, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > On Wed, 2005-11-16 at 10:00 -0800, Andrew Morton wrote:
> >
> > > That will fix it, but the PageWriteback accounting is still wrong.
> > >
> > > Is it not possible to use set_page_writeback()/end_page_writeback()?
> >
> > Not really. The pages aren't flushed at this time. We the point is to
> > gather several pages and coalesce them into one over-the-wire RPC call.
> > That means we cannot really do it from inside ->writepage().
> >
>
> I still don't get it.
>
> Once nfs_writepage() has been called, the page is conceptually "under
> writeback", yes? In that, at some point in the future, it will be written
> to backing store.
>
> Hence it's perfectly appropriate to run set_page_writepage() within
> nfs_writepage(). It's a matter of finding the right place for the
> end_page_writeback().
The point is that the process of flushing has not been started at that
time, so anybody that calls wait_on_page_writeback() immediately after
calling writepage() may end up waiting for a very long time indeed
(probably until the next pdflush).
Cheers,
Trond
I tried the same test, but instead of ftruncate64, I simply did a pwrite64 to get the file
extended... and got 40M+ with mostly outbound traffic, and much less CPU usage.....
Unfortunately, once my test file hit 4295065601, Bad Things (TM) started to happen. The system
time went to 100% of a CPU, and the nfs traffic on that mount stopped.
I got an oprofile of the spinning system:
samples % symbol name
301039 27.9748 zap_pte_range
156234 14.5184 unmap_vmas
111760 10.3856 __bitmap_weight
103624 9.6295 _spin_lock
97063 9.0198 unmap_page_range
67011 6.2272 unmap_mapping_range
59382 5.5182 sub_preempt_count
51258 4.7633 zap_page_range
25235 2.3450 page_address
16768 1.5582 unmap_mapping_range_vma
13257 1.2319 debug_smp_processor_id
11594 1.0774 add_preempt_count
I also seem unable to kill the test process.
Any ideas? (2**32 file size issue somewhere?)
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
--- Kenny Simpson <[email protected]> wrote:
> I also seem unable to kill the test process.
Root is also unable to kill the process (even -9).
top shows its status a R.
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
On Wed, 2005-11-16 at 12:56 -0800, Kenny Simpson wrote:
> I tried the same test, but instead of ftruncate64, I simply did a pwrite64 to get the file
> extended... and got 40M+ with mostly outbound traffic, and much less CPU usage.....
>
> Unfortunately, once my test file hit 4295065601, Bad Things (TM) started to happen. The system
> time went to 100% of a CPU, and the nfs traffic on that mount stopped.
>
> I got an oprofile of the spinning system:
> samples % symbol name
> 301039 27.9748 zap_pte_range
> 156234 14.5184 unmap_vmas
> 111760 10.3856 __bitmap_weight
> 103624 9.6295 _spin_lock
> 97063 9.0198 unmap_page_range
> 67011 6.2272 unmap_mapping_range
> 59382 5.5182 sub_preempt_count
> 51258 4.7633 zap_page_range
> 25235 2.3450 page_address
> 16768 1.5582 unmap_mapping_range_vma
> 13257 1.2319 debug_smp_processor_id
> 11594 1.0774 add_preempt_count
>
> I also seem unable to kill the test process.
>
> Any ideas? (2**32 file size issue somewhere?)
Is this NFSv2?
Cheers,
Trond
--- Trond Myklebust <[email protected]> wrote:
>
> Is this NFSv2?
>
> Cheers,
> Trond
>
Not according to mount(1):
(rw,vers=3,tcp,rsize=32768,wsize=32768,hard,intr,addr=x.x.x.x)
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2005-11-16 at 11:09 -0800, Andrew Morton wrote:
> > Trond Myklebust <[email protected]> wrote:
> > >
> > > On Wed, 2005-11-16 at 10:00 -0800, Andrew Morton wrote:
> > >
> > > > That will fix it, but the PageWriteback accounting is still wrong.
> > > >
> > > > Is it not possible to use set_page_writeback()/end_page_writeback()?
> > >
> > > Not really. The pages aren't flushed at this time. We the point is to
> > > gather several pages and coalesce them into one over-the-wire RPC call.
> > > That means we cannot really do it from inside ->writepage().
> > >
> >
> > I still don't get it.
> >
> > Once nfs_writepage() has been called, the page is conceptually "under
> > writeback", yes? In that, at some point in the future, it will be written
> > to backing store.
> >
> > Hence it's perfectly appropriate to run set_page_writepage() within
> > nfs_writepage(). It's a matter of finding the right place for the
> > end_page_writeback().
>
> The point is that the process of flushing has not been started at that
> time, so anybody that calls wait_on_page_writeback() immediately after
> calling writepage() may end up waiting for a very long time indeed
> (probably until the next pdflush).
But block-backed filesytems have the same concern: we don't want to do a
whole bunch of 4k I/Os. Hence the writepages() interface, which is the
appropriate place to be building up these large I/Os.
NFS does nfw_writepages->mpage_writepages->nfs_writepage and to build the
large I/Os it leaves the I/O pending on return from nfs_writepage(). It
appears to flush any pending pages on the exit path from nfs_writepages().
If that's a correct reading then there doesn't appear to be any way in
which there's dangling I/O left to do after nfs_writepages() completes.
If there _is_ dandling I/O left over then that's problematic, and probably
doesn't buy us much in the way of performance benefit.
--- Trond Myklebust <[email protected]> wrote:
>
> Is this NFSv2?
>
> Cheers,
> Trond
>
This is reproducible with O_DIRECT, but not without.
The profile looks the same:
samples % symbol name
647042 28.4114 zap_pte_range
572195 25.1249 unmap_mapping_range
324291 14.2395 _spin_lock
139259 6.1148 __bitmap_weight
137048 6.0177 zap_page_range
104614 4.5936 unmap_mapping_range_vma
63406 2.7841 debug_smp_processor_id
48906 2.1474 sub_preempt_count
46090 2.0238 unmap_vmas
27966 1.2280 add_preempt_count
23224 1.0198 invalidate_inode_pages2_range
21676 0.9518 unmap_page_range
17825 0.7827 _spin_unlock
I've had mixed results with a local ext3 file with the same test. One run had a 37 second delay
while crossing 4GB, another happily went by without incident.
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
On Wed, 2005-11-16 at 13:31 -0800, Andrew Morton wrote:
> But block-backed filesytems have the same concern: we don't want to do a
> whole bunch of 4k I/Os. Hence the writepages() interface, which is the
> appropriate place to be building up these large I/Os.
>
> NFS does nfw_writepages->mpage_writepages->nfs_writepage and to build the
> large I/Os it leaves the I/O pending on return from nfs_writepage(). It
> appears to flush any pending pages on the exit path from nfs_writepages().
>
> If that's a correct reading then there doesn't appear to be any way in
> which there's dangling I/O left to do after nfs_writepages() completes.
Agreed. AFAICS, nfs_writepages should be quite OK, however writepage()
on its own _is_ problematic.
Look at the usage in write_one_page(), which calls directly down to
->writepage(), and then immediately does a wait_on_page_writeback().
How is the filesystem supposed to distinguish between the cases
"VM->writepage()", and "VM->writepages->mpage_writepages->writepage()"?
Cheers,
Trond
On Wed, 2005-11-16 at 13:41 -0800, Kenny Simpson wrote:
> --- Trond Myklebust <[email protected]> wrote:
> >
> > Is this NFSv2?
> >
> > Cheers,
> > Trond
> >
> This is reproducible with O_DIRECT, but not without.
I'm getting lost here. Please could you spell out the testcases that are
not working.
Are you saying that the combination mmap() + pwrite64() fails on
O_DIRECT, but works on ordinary open, and that mmap() + ftruncate64()
always works?
Cheers,
Trond
--- Trond Myklebust <[email protected]> wrote:
> I'm getting lost here. Please could you spell out the testcases that are
> not working.
>
> Are you saying that the combination mmap() + pwrite64() fails on
> O_DIRECT, but works on ordinary open, and that mmap() + ftruncate64()
> always works?
>
> Cheers,
> Trond
>
ftruncate64 works with O_DIRECT
ftruncate64 works w/o O_DIRECT
pwrite64 FAILS with O_DIRECT at ~4GB
pwrite64 works w/o O_DIRECT.
I am re-running these tests to confirm (could take a minute).
All opens are with O_RDWR | O_CREAT | O_LARGEFILE.
All test over GbE w/ jumbo frames (8160 mtu) to a netapp filer (via x-over cable).
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2005-11-16 at 13:31 -0800, Andrew Morton wrote:
> > But block-backed filesytems have the same concern: we don't want to do a
> > whole bunch of 4k I/Os. Hence the writepages() interface, which is the
> > appropriate place to be building up these large I/Os.
> >
> > NFS does nfw_writepages->mpage_writepages->nfs_writepage and to build the
> > large I/Os it leaves the I/O pending on return from nfs_writepage(). It
> > appears to flush any pending pages on the exit path from nfs_writepages().
> >
> > If that's a correct reading then there doesn't appear to be any way in
> > which there's dangling I/O left to do after nfs_writepages() completes.
>
> Agreed. AFAICS, nfs_writepages should be quite OK, however writepage()
> on its own _is_ problematic.
>
> Look at the usage in write_one_page(), which calls directly down to
> ->writepage(), and then immediately does a wait_on_page_writeback().
>
> How is the filesystem supposed to distinguish between the cases
> "VM->writepage()", and "VM->writepages->mpage_writepages->writepage()"?
>
Via the writeback_control, hopefully.
For write_one_page(), sync_mode==WB_SYNC_ALL, so NFS should start the I/O
immediately (it appears to not do so).
For vmscan->writepage, wbc->for_reclaim is set, so we know that the IO
should be pushed immediately. nfs_writepage() seems to dtrt here.
With the proposed changes, we don't need that iput() in nfs_writepage().
That worries me because I recall from a couple of years back that there are
really subtle races with doing iput() on the vmscan->writepage() path.
Cannot remember what they were though...
On Wed, 2005-11-16 at 14:10 -0800, Andrew Morton wrote:
> > How is the filesystem supposed to distinguish between the cases
> > "VM->writepage()", and "VM->writepages->mpage_writepages->writepage()"?
> >
>
> Via the writeback_control, hopefully.
>
> For write_one_page(), sync_mode==WB_SYNC_ALL, so NFS should start the I/O
> immediately (it appears to not do so).
Sorry, but so does filemap_fdatawrite(). WB_SYNC_ALL clearly does not
discriminate between a writepages() and a single writepage() situation,
whatever the original intention was.
> For vmscan->writepage, wbc->for_reclaim is set, so we know that the IO
> should be pushed immediately. nfs_writepage() seems to dtrt here.
>
> With the proposed changes, we don't need that iput() in nfs_writepage().
> That worries me because I recall from a couple of years back that there are
> really subtle races with doing iput() on the vmscan->writepage() path.
> Cannot remember what they were though...
Possibly to do with block filesystems that may trigger ->writepage()
while inside iput_final()? NFS can't do that.
Cheers,
Trond
On Wed, 2005-11-16 at 17:23 -0500, Trond Myklebust wrote:
> On Wed, 2005-11-16 at 14:10 -0800, Andrew Morton wrote:
>
> > > How is the filesystem supposed to distinguish between the cases
> > > "VM->writepage()", and "VM->writepages->mpage_writepages->writepage()"?
> > >
> >
> > Via the writeback_control, hopefully.
> >
> > For write_one_page(), sync_mode==WB_SYNC_ALL, so NFS should start the I/O
> > immediately (it appears to not do so).
>
> Sorry, but so does filemap_fdatawrite(). WB_SYNC_ALL clearly does not
> discriminate between a writepages() and a single writepage() situation,
> whatever the original intention was.
IMHO, the correct way to distinguish between the two would be to use the
wbc->nr_to_write field. If all the instances of writepage() were to set
that field to '1', then the filesystems could do the right thing.
As it is, you have shrink_list() that sets it to the value
"SWAP_CLUSTER_MAX" for no apparent reason...
Cheers,
Trond
--- Trond Myklebust <[email protected]> wrote:
> I'm getting lost here. Please could you spell out the testcases that are
> not working.
I've redone my test cases and have confirmed that O_DIRECT with pwrite64 triggers the bad
condition.
The cases that are fine are:
pwrite64
ftruncate with O_DIRECT
ftruncate
Also, when the system is in this state, if I try to 'ls' the file,
the 'ls' process becomes stuck in state D in sync_page. stracing the 'ls'
shows it is in a call to stat64.
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2005-11-16 at 14:10 -0800, Andrew Morton wrote:
>
> > > How is the filesystem supposed to distinguish between the cases
> > > "VM->writepage()", and "VM->writepages->mpage_writepages->writepage()"?
> > >
> >
> > Via the writeback_control, hopefully.
> >
> > For write_one_page(), sync_mode==WB_SYNC_ALL, so NFS should start the I/O
> > immediately (it appears to not do so).
>
> Sorry, but so does filemap_fdatawrite(). WB_SYNC_ALL clearly does not
> discriminate between a writepages() and a single writepage() situation,
> whatever the original intention was.
Could peek at wbc->nr_pages, or add another boolean to writeback_control
for this.
diff -puN include/linux/writeback.h~writeback_control-flag-writepages include/linux/writeback.h
--- devel/include/linux/writeback.h~writeback_control-flag-writepages 2005-11-16 14:43:52.000000000 -0800
+++ devel-akpm/include/linux/writeback.h 2005-11-16 14:43:52.000000000 -0800
@@ -53,10 +53,11 @@ struct writeback_control {
loff_t start;
loff_t end;
- unsigned nonblocking:1; /* Don't get stuck on request queues */
- unsigned encountered_congestion:1; /* An output: a queue is full */
- unsigned for_kupdate:1; /* A kupdate writeback */
- unsigned for_reclaim:1; /* Invoked from the page allocator */
+ unsigned nonblocking:1; /* Don't get stuck on request queues */
+ unsigned encountered_congestion:1; /* An output: a queue is full */
+ unsigned for_kupdate:1; /* A kupdate writeback */
+ unsigned for_reclaim:1; /* Invoked from the page allocator */
+ unsigned for_writepages:1; /* This is a writepages() call */
};
/*
diff -puN mm/page-writeback.c~writeback_control-flag-writepages mm/page-writeback.c
--- devel/mm/page-writeback.c~writeback_control-flag-writepages 2005-11-16 14:43:52.000000000 -0800
+++ devel-akpm/mm/page-writeback.c 2005-11-16 14:43:52.000000000 -0800
@@ -550,11 +550,17 @@ void __init page_writeback_init(void)
int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
+ int ret;
+
if (wbc->nr_to_write <= 0)
return 0;
+ wbc->for_writepages = 1;
if (mapping->a_ops->writepages)
- return mapping->a_ops->writepages(mapping, wbc);
- return generic_writepages(mapping, wbc);
+ ret = mapping->a_ops->writepages(mapping, wbc);
+ else
+ ret = generic_writepages(mapping, wbc);
+ wbc->for_writepages = 0;
+ return ret;
}
/**
_
> > For vmscan->writepage, wbc->for_reclaim is set, so we know that the IO
> > should be pushed immediately. nfs_writepage() seems to dtrt here.
> >
> > With the proposed changes, we don't need that iput() in nfs_writepage().
> > That worries me because I recall from a couple of years back that there are
> > really subtle races with doing iput() on the vmscan->writepage() path.
> > Cannot remember what they were though...
>
> Possibly to do with block filesystems that may trigger ->writepage()
> while inside iput_final()? NFS can't do that.
iput_final() can call truncate_inode_pages - maybe it was a deadlock, but
I'm fairly sure it was a race.
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2005-11-16 at 17:23 -0500, Trond Myklebust wrote:
> > On Wed, 2005-11-16 at 14:10 -0800, Andrew Morton wrote:
> >
> > > > How is the filesystem supposed to distinguish between the cases
> > > > "VM->writepage()", and "VM->writepages->mpage_writepages->writepage()"?
> > > >
> > >
> > > Via the writeback_control, hopefully.
> > >
> > > For write_one_page(), sync_mode==WB_SYNC_ALL, so NFS should start the I/O
> > > immediately (it appears to not do so).
> >
> > Sorry, but so does filemap_fdatawrite(). WB_SYNC_ALL clearly does not
> > discriminate between a writepages() and a single writepage() situation,
> > whatever the original intention was.
>
> IMHO, the correct way to distinguish between the two would be to use the
> wbc->nr_to_write field. If all the instances of writepage() were to set
> that field to '1', then the filesystems could do the right thing.
yes, except ->writepages is supposed to decrement nr_to_write as it proceeds,
so it'll end up at `1' by accident on the last go around the loop.
I think a separate boolean is better - it's just a single bit.
> As it is, you have shrink_list() that sets it to the value
> "SWAP_CLUSTER_MAX" for no apparent reason...
How weird. That's presumably wrong, but I'd need to check the changelogs
to doublecheck. ugh, 264 of them.
On Wed, 2005-11-16 at 14:39 -0800, Kenny Simpson wrote:
> --- Trond Myklebust <[email protected]> wrote:
> > I'm getting lost here. Please could you spell out the testcases that are
> > not working.
>
> I've redone my test cases and have confirmed that O_DIRECT with pwrite64 triggers the bad
> condition.
>
> The cases that are fine are:
> pwrite64
> ftruncate with O_DIRECT
> ftruncate
>
> Also, when the system is in this state, if I try to 'ls' the file,
> the 'ls' process becomes stuck in state D in sync_page. stracing the 'ls'
> shows it is in a call to stat64.
>
> -Kenny
Chuck, can you take a look at this?
Kenny is seeing what a hang when using pwrite64() on an O_DIRECT file
and the file size exceeds 4Gb. Server is a NetApp filer w/ NFSv3.
I had a quick look at nfs_file_direct_write(), and among other things,
it would appear that it is not doing any of the usual overflow checks on
*pos and the count size (see generic_write_checks()). In particular,
checks are missing against overflow vs. MAX_NON_LFS if O_LARGEFILE is
not set (and also against overflow vs. s_maxbytes, but that is less
relevant here).
Cheers,
Trond
On Wed, 2005-11-16 at 14:44 -0800, Andrew Morton wrote:
> Could peek at wbc->nr_pages, or add another boolean to writeback_control
> for this.
>
> diff -puN include/linux/writeback.h~writeback_control-flag-writepages include/linux/writeback.h
> --- devel/include/linux/writeback.h~writeback_control-flag-writepages 2005-11-16 14:43:52.000000000 -0800
> +++ devel-akpm/include/linux/writeback.h 2005-11-16 14:43:52.000000000 -0800
> @@ -53,10 +53,11 @@ struct writeback_control {
> loff_t start;
> loff_t end;
>
> - unsigned nonblocking:1; /* Don't get stuck on request queues */
> - unsigned encountered_congestion:1; /* An output: a queue is full */
> - unsigned for_kupdate:1; /* A kupdate writeback */
> - unsigned for_reclaim:1; /* Invoked from the page allocator */
> + unsigned nonblocking:1; /* Don't get stuck on request queues */
> + unsigned encountered_congestion:1; /* An output: a queue is full */
> + unsigned for_kupdate:1; /* A kupdate writeback */
> + unsigned for_reclaim:1; /* Invoked from the page allocator */
> + unsigned for_writepages:1; /* This is a writepages() call */
> };
>
> /*
> diff -puN mm/page-writeback.c~writeback_control-flag-writepages mm/page-writeback.c
> --- devel/mm/page-writeback.c~writeback_control-flag-writepages 2005-11-16 14:43:52.000000000 -0800
> +++ devel-akpm/mm/page-writeback.c 2005-11-16 14:43:52.000000000 -0800
> @@ -550,11 +550,17 @@ void __init page_writeback_init(void)
>
> int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> {
> + int ret;
> +
> if (wbc->nr_to_write <= 0)
> return 0;
> + wbc->for_writepages = 1;
> if (mapping->a_ops->writepages)
> - return mapping->a_ops->writepages(mapping, wbc);
> - return generic_writepages(mapping, wbc);
> + ret = mapping->a_ops->writepages(mapping, wbc);
> + else
> + ret = generic_writepages(mapping, wbc);
> + wbc->for_writepages = 0;
> + return ret;
> }
That would work...
> > > For vmscan->writepage, wbc->for_reclaim is set, so we know that the IO
> > > should be pushed immediately. nfs_writepage() seems to dtrt here.
> > >
> > > With the proposed changes, we don't need that iput() in nfs_writepage().
> > > That worries me because I recall from a couple of years back that there are
> > > really subtle races with doing iput() on the vmscan->writepage() path.
> > > Cannot remember what they were though...
> >
> > Possibly to do with block filesystems that may trigger ->writepage()
> > while inside iput_final()? NFS can't do that.
>
> iput_final() can call truncate_inode_pages - maybe it was a deadlock, but
> I'm fairly sure it was a race.
Doesn't matter. There can be no dirty pages when NFS hits iput_final().
We make sure that we flush them into the filesystem accounting before we
release the file descriptor, then we make sure that we don't release the
dentry before the inode has been synced up.
Cheers,
Trond
On Wed, 2005-11-16 at 14:44 -0800, Andrew Morton wrote:
> diff -puN include/linux/writeback.h~writeback_control-flag-writepages include/linux/writeback.h
> --- devel/include/linux/writeback.h~writeback_control-flag-writepages 2005-11-16 14:43:52.000000000 -0800
> +++ devel-akpm/include/linux/writeback.h 2005-11-16 14:43:52.000000000 -0800
> @@ -53,10 +53,11 @@ struct writeback_control {
> loff_t start;
> loff_t end;
>
> - unsigned nonblocking:1; /* Don't get stuck on request queues */
> - unsigned encountered_congestion:1; /* An output: a queue is full */
> - unsigned for_kupdate:1; /* A kupdate writeback */
> - unsigned for_reclaim:1; /* Invoked from the page allocator */
> + unsigned nonblocking:1; /* Don't get stuck on request queues */
> + unsigned encountered_congestion:1; /* An output: a queue is full */
> + unsigned for_kupdate:1; /* A kupdate writeback */
> + unsigned for_reclaim:1; /* Invoked from the page allocator */
> + unsigned for_writepages:1; /* This is a writepages() call */
> };
>
> /*
> diff -puN mm/page-writeback.c~writeback_control-flag-writepages mm/page-writeback.c
> --- devel/mm/page-writeback.c~writeback_control-flag-writepages 2005-11-16 14:43:52.000000000 -0800
> +++ devel-akpm/mm/page-writeback.c 2005-11-16 14:43:52.000000000 -0800
> @@ -550,11 +550,17 @@ void __init page_writeback_init(void)
>
> int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> {
> + int ret;
> +
> if (wbc->nr_to_write <= 0)
> return 0;
> + wbc->for_writepages = 1;
> if (mapping->a_ops->writepages)
> - return mapping->a_ops->writepages(mapping, wbc);
> - return generic_writepages(mapping, wbc);
> + ret = mapping->a_ops->writepages(mapping, wbc);
> + else
> + ret = generic_writepages(mapping, wbc);
> + wbc->for_writepages = 0;
> + return ret;
> }
The accompanying NFS patch makes use of this in order to figure out when
to flush the data correctly.
-------------
NFS: Work correctly with single-page ->writepage() calls
Ensure that we use set_page_writeback() in the appropriate places
to help the VM in keeping its page radix_tree in sync.
Ensure that we always initiate flushing of data before we exit
a single-page ->writepage() call.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/write.c | 22 +++++++++-------------
1 files changed, 9 insertions(+), 13 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 8f71e76..95d00f9 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -189,6 +189,7 @@ static int nfs_writepage_sync(struct nfs
(long long)NFS_FILEID(inode),
count, (long long)(page_offset(page) + offset));
+ set_page_writeback(page);
nfs_begin_data_update(inode);
do {
if (count < wsize)
@@ -221,6 +222,7 @@ static int nfs_writepage_sync(struct nfs
io_error:
nfs_end_data_update(inode);
+ end_page_writeback(page);
nfs_writedata_free(wdata);
return written ? written : result;
}
@@ -230,19 +232,16 @@ static int nfs_writepage_async(struct nf
unsigned int offset, unsigned int count)
{
struct nfs_page *req;
- int status;
req = nfs_update_request(ctx, inode, page, offset, count);
- status = (IS_ERR(req)) ? PTR_ERR(req) : 0;
- if (status < 0)
- goto out;
+ if (IS_ERR(req))
+ return PTR_ERR(req);
/* Update file length */
nfs_grow_file(page, offset, count);
/* Set the PG_uptodate flag? */
nfs_mark_uptodate(page, offset, count);
nfs_unlock_request(req);
- out:
- return status;
+ return 0;
}
static int wb_priority(struct writeback_control *wbc)
@@ -302,11 +301,8 @@ do_it:
lock_kernel();
if (!IS_SYNC(inode) && inode_referenced) {
err = nfs_writepage_async(ctx, inode, page, 0, offset);
- if (err >= 0) {
- err = 0;
- if (wbc->for_reclaim)
- nfs_flush_inode(inode, 0, 0, FLUSH_STABLE);
- }
+ if (!wbc->for_writepages)
+ nfs_flush_inode(inode, 0, 0, wb_priority(wbc));
} else {
err = nfs_writepage_sync(ctx, inode, page, 0,
offset, priority);
@@ -929,7 +925,7 @@ static int nfs_flush_multi(struct list_h
atomic_set(&req->wb_complete, requests);
ClearPageError(page);
- SetPageWriteback(page);
+ set_page_writeback(page);
offset = 0;
nbytes = req->wb_bytes;
do {
@@ -992,7 +988,7 @@ static int nfs_flush_one(struct list_hea
nfs_list_remove_request(req);
nfs_list_add_request(req, &data->pages);
ClearPageError(req->wb_page);
- SetPageWriteback(req->wb_page);
+ set_page_writeback(req->wb_page);
*pages++ = req->wb_page;
count += req->wb_bytes;
}
Trond Myklebust <[email protected]> wrote:
>
> The accompanying NFS patch makes use of this in order to figure out when
> to flush the data correctly.
OK. So with that patch, nfs_writepages() may still leave I/O pending,
uninitiated, yes?
I don't understand why NFS hasn't been BUGging as it stands at present. It
has several end_page_writeback() calls but no set_page_writeback()s.
end_page_writeback() or rotate_reclaimable_page() will go BUG if the page
wasn't PageWriteback().
On Wed, 2005-11-16 at 16:25 -0800, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > The accompanying NFS patch makes use of this in order to figure out when
> > to flush the data correctly.
>
> OK. So with that patch, nfs_writepages() may still leave I/O pending,
> uninitiated, yes?
>
> I don't understand why NFS hasn't been BUGging as it stands at present. It
> has several end_page_writeback() calls but no set_page_writeback()s.
> end_page_writeback() or rotate_reclaimable_page() will go BUG if the page
> wasn't PageWriteback().
It does have SetPageWriteback() calls in the asynchronous writeback
path. As you can see from the patch I just sent, I only needed to
replace them with set_page_writebacks().
Cheers,
Trond
Trond Myklebust <[email protected]> wrote:
>
> On Wed, 2005-11-16 at 16:25 -0800, Andrew Morton wrote:
> > Trond Myklebust <[email protected]> wrote:
> > >
> > > The accompanying NFS patch makes use of this in order to figure out when
> > > to flush the data correctly.
> >
> > OK. So with that patch, nfs_writepages() may still leave I/O pending,
> > uninitiated, yes?
This?
I don't know if it'll be a problem. One factor is that when the VFS is
doing an fsync() or whatever, it will fail to notice these left-over pages
are "dirty", so it won't launch writepage() against them.
But if they are marked PageWriteback(), sync will notice them on the second
pass and will wait upon them, which apparently could mean a stall until
pdflush kicks off the I/O?
If they're not marked PageDirty() or PageWriteback(), the VFS will miss
them altogether during the sync. But perhaps NFS's own page tracking will
flush them and wait upon the result?
> > I don't understand why NFS hasn't been BUGging as it stands at present. It
> > has several end_page_writeback() calls but no set_page_writeback()s.
> > end_page_writeback() or rotate_reclaimable_page() will go BUG if the page
> > wasn't PageWriteback().
>
> It does have SetPageWriteback() calls in the asynchronous writeback
> path. As you can see from the patch I just sent, I only needed to
> replace them with set_page_writebacks().
Ah, OK. Things are improved.
On Wed, 2005-11-16 at 16:38 -0800, Andrew Morton wrote:
> I don't know if it'll be a problem. One factor is that when the VFS is
> doing an fsync() or whatever, it will fail to notice these left-over pages
> are "dirty", so it won't launch writepage() against them.
That doesn't matter. They are being tracked by the NFS client. We don't
want anyone to call writepage() against them again because that will
cause them to be written out twice.
> But if they are marked PageWriteback(), sync will notice them on the second
> pass and will wait upon them, which apparently could mean a stall until
> pdflush kicks off the I/O?
>
> If they're not marked PageDirty() or PageWriteback(), the VFS will miss
> them altogether during the sync. But perhaps NFS's own page tracking will
> flush them and wait upon the result?
Yes. There is no chance of data loss (unless someone physically pulls
the plug on the client - there's no protecting against that).
Note that writepages() will normally end up calling nfs_flush_inode().
It will only fail to do so if
- generic_writepages() returns an error
or
- there is write congestion, and wbc->nonblocking is set.
Cheers,
Trond
Christoph Hellwig writes:
> On Wed, Nov 16, 2005 at 01:34:22PM -0500, Trond Myklebust wrote:
> > Not really. The pages aren't flushed at this time. We the point is to
> > gather several pages and coalesce them into one over-the-wire RPC call.
> > That means we cannot really do it from inside ->writepage().
> >
> > We do start the actual RPC calls in ->writepages(), though.
>
> This is a problem we have in various filesystems. Except for really
> bad OOM situations the filesystem should never get a writeout request
> for a single file. We should really stop having ->writepage called by
> the VM and move this kind of batching code into the VM. I'm runnin into
> similar issues for XFS and unwritten/delayed extent conversion once again.
Simplistic version if such batching is implemented in the patch below
(also available at
http://linuxhacker.ru/~nikita/patches/2.6.15-rc1/05-cluster-pageout.patch
it depends on page_referenced-move-dirty patch from the same place)
This version pokes into address_space radix tree to find a cluster of
pages suitable for page-out and then calls ->writepage() on pages in
that cluster in the proper order. This relies on the underlying layer
(e.g., block device) to perform request coalescing.
My earlier attempts to do this through ->writepages() were all racy,
because at some point ->writepages() has to release a lock at the
original page around which the cluster is built, and that lock is the
only thing that protects inode/address_space from the destruction. As
was already noted by Andrew, one cannot use igrab/iput in the VM scanner
to deal with that.
I still think it's possible to do higher layer batching, but that would
require more extensive changes to both VM scanner and ->writepages().
>
Nikita.
--
Implement pageout clustering at the VM level.
With this patch VM scanner calls pageout_cluster() instead of
->writepage(). pageout_cluster() tries to find a group of dirty pages around
target page, called "pivot" page of the cluster. If group of suitable size is
found, ->writepages() is called for it, otherwise, page_cluster() falls back
to ->writepage().
This is supposed to help in work-loads with significant page-out of
file-system pages from tail of the inactive list (for example, heavy dirtying
through mmap), because file system usually writes multiple pages more
efficiently. Should also be advantageous for file-systems doing delayed
allocation, as in this case they will allocate whole extents at once.
Few points:
- swap-cache pages are not clustered (although they can be, but by
page->private rather than page->index)
- only kswapd does clustering, because direct reclaim path should be low
latency.
- Original version of this patch added new fields to struct writeback_control
and expected ->writepages() to interpret them. This led to hard-to-fix races
against inode reclamation. Current version simply calls ->writepage() in the
"correct" order, i.e., in the order of increasing page indices.
Signed-off-by: Nikita Danilov <[email protected]>
mm/shmem.c | 14 ++++++-
mm/vmscan.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 124 insertions(+), 2 deletions(-)
diff -puN mm/vmscan.c~cluster-pageout mm/vmscan.c
--- git-linux/mm/vmscan.c~cluster-pageout 2005-11-13 21:25:15.000000000 +0300
+++ git-linux-nikita/mm/vmscan.c 2005-11-13 21:25:15.000000000 +0300
@@ -360,6 +360,116 @@ static void send_page_to_kpgout(struct p
spin_unlock(&kpgout_queue_lock);
}
+enum {
+ PAGE_CLUSTER_WING = 16,
+ PAGE_CLUSTER_SIZE = 2 * PAGE_CLUSTER_WING,
+};
+
+static int page_fits_cluster(struct address_space *mapping, struct page *page)
+{
+ int result;
+
+ if (page != NULL && !PageActive(page) && !TestSetPageLocked(page)) {
+ /*
+ * unlock ->tree_lock to avoid lock inversion with
+ * ->i_mmap_lock in page_referenced().
+ */
+ read_unlock_irq(&mapping->tree_lock);
+ result =
+ /* try_to_unmap(page) == SWAP_SUCCESS && */
+ PageDirty(page) && !PageWriteback(page) &&
+ !page_referenced(page, 1,
+ page_zone(page)->temp_priority <= 0,
+ 1);
+ if (result == 0)
+ unlock_page(page);
+ read_lock_irq(&mapping->tree_lock);
+ } else
+ result = 0;
+ return result;
+}
+
+static void call_writepage(struct page *page, struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ if (clear_page_dirty_for_io(page)) {
+ int result;
+
+ BUG_ON(!PageLocked(page));
+ BUG_ON(PageWriteback(page));
+
+ result = mapping->a_ops->writepage(page, wbc);
+ if (result == WRITEPAGE_ACTIVATE)
+ unlock_page(page);
+ }
+}
+
+int __pageout_cluster(struct page *page, struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ int result;
+ int used;
+
+ pgoff_t punct;
+ pgoff_t start;
+ pgoff_t end;
+
+ struct page *pages_out[PAGE_CLUSTER_WING];
+ struct page *scan;
+
+ BUG_ON(PageAnon(page));
+
+ punct = page->index;
+ read_lock_irq(&mapping->tree_lock);
+ for (start = punct - 1, used = 0;
+ start < punct && punct - start <= PAGE_CLUSTER_WING; start --) {
+ scan = radix_tree_lookup(&mapping->page_tree, start);
+ if (!page_fits_cluster(mapping, scan))
+ /*
+ * no suitable page, stop cluster at this point
+ */
+ break;
+ pages_out[used ++] = scan;
+ if ((start % PAGE_CLUSTER_SIZE) == 0)
+ /*
+ * we reached aligned page.
+ */
+ break;
+ }
+ read_unlock_irq(&mapping->tree_lock);
+
+ while (used > 0)
+ call_writepage(pages_out[--used], mapping, wbc);
+
+ result = mapping->a_ops->writepage(page, wbc);
+
+ for (end = punct + 1;
+ end > punct && end - start < PAGE_CLUSTER_SIZE; ++ end) {
+ int enough;
+
+ /*
+ * XXX nikita: consider find_get_pages_tag()
+ */
+ read_lock_irq(&mapping->tree_lock);
+ scan = radix_tree_lookup(&mapping->page_tree, end);
+ enough = !page_fits_cluster(mapping, scan);
+ read_unlock_irq(&mapping->tree_lock);
+ if (enough)
+ break;
+ call_writepage(scan, mapping, wbc);
+ }
+ return result;
+}
+
+static int pageout_cluster(struct page *page, struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ if (PageSwapCache(page) || !current_is_kswapd())
+ return mapping->a_ops->writepage(page, wbc);
+ else
+ return __pageout_cluster(page, mapping, wbc);
+}
+
/*
* Called by shrink_list() for each dirty page. Calls ->writepage().
*/
@@ -445,7 +555,7 @@ static pageout_t pageout(struct page *pa
ClearPageSkipped(page);
SetPageReclaim(page);
- res = mapping->a_ops->writepage(page, &wbc);
+ res = pageout_cluster(page, mapping, &wbc);
if (res < 0)
handle_write_error(mapping, page, res);
diff -puN include/linux/writeback.h~cluster-pageout include/linux/writeback.h
diff -puN fs/mpage.c~cluster-pageout fs/mpage.c
diff -puN mm/shmem.c~cluster-pageout mm/shmem.c
--- git-linux/mm/shmem.c~cluster-pageout 2005-11-13 21:25:15.000000000 +0300
+++ git-linux-nikita/mm/shmem.c 2005-11-13 21:25:15.000000000 +0300
@@ -45,6 +45,7 @@
#include <linux/swapops.h>
#include <linux/mempolicy.h>
#include <linux/namei.h>
+#include <linux/rmap.h>
#include <asm/uaccess.h>
#include <asm/div64.h>
#include <asm/pgtable.h>
@@ -813,7 +814,18 @@ static int shmem_writepage(struct page *
struct inode *inode;
BUG_ON(!PageLocked(page));
- BUG_ON(page_mapped(page));
+
+ /*
+ * If shmem_writepage() is called on mapped page, a problem arises for
+ * a tmpfs file mapped shared into different mms. Viz. shmem_writepage
+ * changes the tmpfs-file identity of the page to swap identity: so if
+ * it's unmapped later, the instances would then become private (to be
+ * COWed) instead of shared.
+ *
+ * Just unmap page.
+ */
+ if (page_mapped(page) && try_to_unmap(page) != SWAP_SUCCESS)
+ goto redirty;
mapping = page->mapping;
index = page->index;
_
Trond Myklebust wrote:
> On Wed, 2005-11-16 at 14:39 -0800, Kenny Simpson wrote:
>
>>--- Trond Myklebust <[email protected]> wrote:
>>
>>>I'm getting lost here. Please could you spell out the testcases that are
>>>not working.
>>
>>I've redone my test cases and have confirmed that O_DIRECT with pwrite64 triggers the bad
>>condition.
>>
>>The cases that are fine are:
>> pwrite64
>> ftruncate with O_DIRECT
>> ftruncate
>>
>>Also, when the system is in this state, if I try to 'ls' the file,
>>the 'ls' process becomes stuck in state D in sync_page. stracing the 'ls'
>>shows it is in a call to stat64.
>>
>>-Kenny
>
>
> Chuck, can you take a look at this?
>
> Kenny is seeing a hang when using pwrite64() on an O_DIRECT file
> and the file size exceeds 4Gb. Server is a NetApp filer w/ NFSv3.
>
> I had a quick look at nfs_file_direct_write(), and among other things,
> it would appear that it is not doing any of the usual overflow checks on
> *pos and the count size (see generic_write_checks()). In particular,
> checks are missing against overflow vs. MAX_NON_LFS if O_LARGEFILE is
> not set (and also against overflow vs. s_maxbytes, but that is less
> relevant here).
'uname -a' on the client?
--- Trond Myklebust <[email protected]> wrote:
> Chuck, can you take a look at this?
>
> Kenny is seeing what a hang when using pwrite64() on an O_DIRECT file
> and the file size exceeds 4Gb. Server is a NetApp filer w/ NFSv3.
>
> I had a quick look at nfs_file_direct_write(), and among other things,
> it would appear that it is not doing any of the usual overflow checks on
> *pos and the count size (see generic_write_checks()). In particular,
> checks are missing against overflow vs. MAX_NON_LFS if O_LARGEFILE is
> not set (and also against overflow vs. s_maxbytes, but that is less
> relevant here).
>
> Cheers,
> Trond
I tried the same test, but starting closer to 4GB... here is the final lines from strace:
remap_file_pages(0xb7b55000, 2097152, PROT_NONE, 1047544, MAP_SHARED) = 0
pwrite(3, "\0", 1, 8564768768) = 1
remap_file_pages(0xb7b55000, 2097152, PROT_NONE, 1048056, MAP_SHARED) = 0
pwrite(3, "\0", 1, 8566865920) = 1
remap_file_pages(0xb7b55000, 2097152, PROT_NONE, 1048568, MAP_SHARED) = 0
pwrite(3, "\0", 1, 8568963072
The pwrite never returns.
So it seems to be a problem NOT with an absolute 4GB, but with a total of 4GB having been written.
Here are the first few lines from the strace to show all the options being used:
open("/mnt/bar", O_RDWR|O_CREAT|O_DIRECT|O_LARGEFILE, 0644) = 3
pwrite(3, "\0", 1, 4280287232) = 1
mmap2(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0xff000) = 0xb7b8e000
pwrite(3, "\0", 1, 4282384384) = 1
remap_file_pages(0xb7b8e000, 2097152, PROT_NONE, 2552, MAP_SHARED) = 0
pwrite(3, "\0", 1, 4284481536) = 1
remap_file_pages(0xb7b8e000, 2097152, PROT_NONE, 3064, MAP_SHARED) = 0
/mnt is an nfs mount over GbE w/ jumbo frames (8160 mtu) cross-over directly to a netapp filer.
The mount options are: (from /proc/mounts)
/mnt nfs rw,v3,rsize=32768,wsize=32768,hard,intr,lock,proto=tcp,addr=x.x.x.x 0 0
The card is an Intel e1000 - default module options (NAPI-enabled)
on a 64-bit PCIX 100MHz.
Kernel is 2.6.15-rc w/ Trond's nfs patch.
Machine is a 2x Pentium 4 Xeon 2.66GHz (HT enabled), w/ 2GB ram and 4GB swap.
vmstat shows:
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 1336864 123212 203936 0 0 0 20 1111 1129 1 26 73 0
1 0 0 1336608 123212 203936 0 0 0 0 1078 1076 1 25 74 0
1 0 0 1336864 123212 203936 0 0 0 0 1077 1087 1 26 73 0
the sy of 25 is one virtual CPU with 100% system.
Oprofile shows time being spent:
samples % symbol name
303102 42.4732 zap_pte_range
133702 18.7355 _spin_lock
61145 8.5682 __bitmap_weight
43169 6.0492 page_address
42196 5.9129 unmap_vmas
30132 4.2224 unmap_page_range
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
--- Chuck Lever <[email protected]> wrote:
>
> 'uname -a' on the client?
Linux tux6127 2.6.15-rc1 #6 SMP PREEMPT Wed Nov 16 14:47:14 EST 2005 i686 GNU/Linux
I also sent the .config on a previous posting. I can send it again if you'd like.
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Trond Myklebust wrote:
> I had a quick look at nfs_file_direct_write(), and among other things,
> it would appear that it is not doing any of the usual overflow checks on
> *pos and the count size (see generic_write_checks()). In particular,
> checks are missing against overflow vs. MAX_NON_LFS if O_LARGEFILE is
> not set (and also against overflow vs. s_maxbytes, but that is less
> relevant here).
the architecture is to allow the NFS protocol and server to do these checks.
On Thu, 2005-11-17 at 12:02 -0500, Chuck Lever wrote:
> Trond Myklebust wrote:
> > I had a quick look at nfs_file_direct_write(), and among other things,
> > it would appear that it is not doing any of the usual overflow checks on
> > *pos and the count size (see generic_write_checks()). In particular,
> > checks are missing against overflow vs. MAX_NON_LFS if O_LARGEFILE is
> > not set (and also against overflow vs. s_maxbytes, but that is less
> > relevant here).
>
> the architecture is to allow the NFS protocol and server to do these checks.
No it isn't.
The NFS protocol has no clue as to whether or not you opened the file
using O_LARGEFILE. For NFSv2, we do _not_ want file pointers to wrap
once they hit the 32-bit boundary.
The protocol and server cannot be involved in any of those checks. They
must be done on the client.
Cheers,
Trond
Kenny Simpson <[email protected]> wrote:
>
> The pwrite never returns.
> So it seems to be a problem NOT with an absolute 4GB, but with a total of 4GB having been written.
Could you send the test app please? (Apologies if you've already done so
and I missed it).
--- Andrew Morton <[email protected]> wrote:
> Could you send the test app please? (Apologies if you've already done so
> and I missed it).
>
Here it is again... this one skips to just under 4GB before starting.
run with "writetest -m <filename>" for the mmap test.
-Kenny
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
Same test now done w/ rc1-mm2...
similar result.
However, instead of pegging 1 virtual cpu in system time, two virtual CPUs on the same core now
share the 100% load (so each shows 50% load and 50% idle).
Oprofile shows:
5610635 66.9136 zap_pte_range
711147 8.4813 _raw_spin_trylock
498282 5.9426 unmap_mapping_range
196369 2.3419 add_preempt_count
126243 1.5056 unmap_page_range
111791 1.3332 __bitmap_weight
96284 1.1483 page_address
88998 1.0614 _raw_spin_unlock
And I'm attaching the sysrq dump
-Kenny
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
Instead of sysrq with 't', here is the sysrq with 'p'... it agrees with oprofile
Has anyone else been able to repoduce this?
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Yet another data point:
Under 2.6.8-2 (debain sarge kernel), the test does not cause a spin.
Instead, the file extension via pwrite does not allow the new pages to be usable by
remap_file_pages.
However, munmap/mmap are happy to use pages intoduces by the pwrite...
and happily writes more than 4GB.
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
--- Andrew Morton <[email protected]> wrote:
> Could you send the test app please? (Apologies if you've already done so
> and I missed it).
Here is an strace from a reduced test:
open("/mnt/bar", O_RDWR|O_CREAT|O_DIRECT|O_LARGEFILE, 0644) = 3
pwrite(3, "\0", 1, 4292870144) = 1
mmap2(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0xffc00) = 0xb7ace000
pwrite(3, "\0", 1, 4294967296) = 1
munmap(0xb7ace000, 2097152) = 0
mmap2(0xb7ace000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 3, 0xffe00) = 0xb7ace000
pwrite(3, "\0", 1, 4297064448) = 1
munmap(0xb7ace000, 2097152) = 0
mmap2(0xb7ace000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 3, 0x100000) = 0xb7ace000
pwrite(3, "\0", 1, 4299161600) = 1
munmap(0xb7ace000, 2097152) = 0
mmap2(0xb7ace000, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_FIXED, 3, 0x100200) = 0xb7ace000
pwrite(3, "\0", 1, 4301258752
The final pwrite never returns.
This is with 2.6.15-rc1 + Trond's NFS patch (-mm2 would die on strace).
The only change from the previous test is that this one uses munmap/mmap64 instead of
remap_file_pages.
showPc shows:
SysRq : Show Regs
Pid: 4271, comm: writetest
EIP: 0060:[<c029cd2b>] CPU: 0
EIP is at prio_tree_first+0x26/0xb8
EFLAGS: 00000292 Tainted: P (2.6.15-rc1)
EAX: 00000000 EBX: f6765d94 ECX: f6765d94 EDX: 00000200
ESI: 00000000 EDI: f5da475c EBP: f6765db4 DS: 007b ES: 007b
CR0: 8005003b CR2: b7590000 CR3: 36b5d000 CR4: 000006d0
[<c029ce5e>] prio_tree_next+0xa1/0xa3
[<c014822a>] vma_prio_tree_next+0x27/0x51
[<c014b1dc>] unmap_mapping_range+0x18b/0x210
[<c0120d06>] __do_softirq+0x6a/0xd1
[<c0103a24>] apic_timer_interrupt+0x1c/0x24
[<c0146577>] invalidate_inode_pages2_range+0x215/0x24c
[<c01465cd>] invalidate_inode_pages2+0x1f/0x26
[<c01e4031>] nfs_file_direct_write+0x1e1/0x21a
[<c0159ea4>] do_sync_write+0xc7/0x10d
[<c012ff32>] autoremove_wake_function+0x0/0x57
[<c0159f92>] vfs_write+0xa8/0x177
[<c015a275>] sys_pwrite64+0x88/0x8c
[<c0102f51>] syscall_call+0x7/0xb
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
I have a smaller test case (4 system calls, and a memset), that causes the test case to hang in an
unkillable state*, and makes the system load consume an entire CPU.
*the process is killable if run under strace, but the system load does not drop when the strace is
killed.
Pass this the name of a target file on an NFS mount.
(tested to fail on 2.6.15-rc1).
-Kenny
Here is the test:
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <strind.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char* argv[])
{
if (argc != 2) {
printf("usage: %0 <filename>\n", argv[0]);
return 0;
}
{
int fd = open(argv[1], O_RDWR | O_CREAT | O_LARGEFILE | O_DIRECT, 0644);
if (fd < 0) {
perror("open");
return 0;
}
int window_size = 2 * 1024 * 1024;
long long file_size = window_size;
/* fast-forward */
file_size += 2047u * (2 * 1024 * 1024);
file_size += window_size + window_size;
/* grow file */
pwrite64(fd, "", 1, file_size);
{
char* mapping_start = (char*)mmap64(0, window_size,
PROT_READ | PROT_WRITE,
MAP_SHARED,
fd, file_size - window_size);
/* test only fails with this: */
memset(mapping_start, 0, window_size);
}
/* grow file */
file_size += window_size;
/* this never returns */
pwrite64(fd, "", 1, file_size);
}
return 0;
}
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Kenny Simpson wrote:
> I have a smaller test case (4 system calls, and a memset), that causes the test case to hang in an
> unkillable state*, and makes the system load consume an entire CPU.
>
> *the process is killable if run under strace, but the system load does not drop when the strace is
> killed.
>
> Pass this the name of a target file on an NFS mount.
>
> (tested to fail on 2.6.15-rc1).
kenny-
i'm assuming that because you copied trond, this is only reproducible on
NFS. have you tried this test on other local and remote file system types?
--- Chuck Lever <[email protected]> wrote:
> kenny-
>
> i'm assuming that because you copied trond, this is only reproducible on
> NFS. have you tried this test on other local and remote file system types?
Yes, this only applies to NFS.
ext3 doesn't let you use pwrite with O_DIRECT, nor does NFS from 2.6.8.
These are the only 2 filesystem types to which I have access.
For ext3, using ftruncate works just fine for extending the file, but on NFS, ftruncate causes the
non-existent pages to be read in.
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
> I have a smaller test case (4 system calls, and a memset), that causes the test case to hang in
an
> unkillable state*, and makes the system load consume an entire CPU.
Problem still exists in -rc2, but OProfile shows slightly different results:
samples % symbol name
2919823 86.8716 unmap_mapping_range
163379 4.8609 _raw_spin_trylock
36453 1.0846 prio_tree_first
-Kenny
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
Kenny Simpson <[email protected]> wrote:
>
> ext3 doesn't let you use pwrite with O_DIRECT
ext3 does permit that. See odwrite.c from
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz
With gentle beating by a clue-stick from Andrew.. I can run the same test on ext3...
ext3 is happy...
open("/data/foo", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT|O_LARGEFILE, 0666) = 3
pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4299161600) = 4096
mmap2(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x100200) = 0xb7c7f000
pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4301258752) = 4096
exit_group(0) = ?
but NFS is still unhappy....
open("/mnt/bar", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT|O_LARGEFILE, 0666) = 3
pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4299161600) = 4096
mmap2(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x100200) = 0xb7bc2000
pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4301258752 <never
returns>
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Another data point: In 2.6.8, the test works fine (just like on ext3).
Any suggestions as to where to start poking, or shall I just do a binary search?
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
On Mon, 2005-11-21 at 13:39 -0800, Kenny Simpson wrote:
> With gentle beating by a clue-stick from Andrew.. I can run the same test on ext3...
>
> ext3 is happy...
>
> open("/data/foo", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT|O_LARGEFILE, 0666) = 3
> pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4299161600) = 4096
> mmap2(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x100200) = 0xb7c7f000
> pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4301258752) = 4096
> exit_group(0) = ?
>
>
> but NFS is still unhappy....
>
> open("/mnt/bar", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT|O_LARGEFILE, 0666) = 3
> pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4299161600) = 4096
> mmap2(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x100200) = 0xb7bc2000
> pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4301258752 <never
> returns>
Ah... It is the pwrite() _after_ the call to mmap() that fails....
OK, does the following patch fix it?
Cheers,
Trond
-------------
NFS: O_DIRECT cannot call invalidate_inode_pages2().
Anything that calls lock_page() should be avoided in O_DIRECT, however
we should be able to call invalidate_inode_pages() since that doesn't
wait on the page lock.
Signed-off-by: Trond Myklebust <[email protected]>
---
fs/nfs/direct.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index a2d2814..ef299f8 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -774,7 +774,7 @@ nfs_file_direct_write(struct kiocb *iocb
retval = nfs_direct_write(inode, ctx, &iov, pos, 1);
if (mapping->nrpages)
- invalidate_inode_pages2(mapping);
+ invalidate_inode_pages(mapping);
if (retval > 0)
*ppos = pos + retval;
--- Trond Myklebust <[email protected]> wrote:
> Ah... It is the pwrite() _after_ the call to mmap() that fails....
>
> OK, does the following patch fix it?
YES!
open("/mnt/bar", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT|O_LARGEFILE, 0666) = 3
pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4299161600) = 4096
mmap2(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0x100200) = 0xb7c26000
pwrite(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096, 4301258752) = 4096
exit_group(0) = ?
I'll re-run my original test(s) tomorrow.
Thanks again!
-Kenny
__________________________________
Yahoo! FareChase: Search multiple travel sites in one click.
http://farechase.yahoo.com
Trond Myklebust <[email protected]> wrote:
>
> Anything that calls lock_page() should be avoided in O_DIRECT,
Why?
And it's still doing lock_page():
nfs_file_direct_write()
->filemap_fdatawrite()
->do_writepages()
->nfs_writepages()
->generic_writepages()
->mpage_writepages()
->lock_page()
> however
> we should be able to call invalidate_inode_pages() since that doesn't
> wait on the page lock.
invalidate_inode_pages2() is better. And using generic_file_direct_IO() is
better still, since it handles mmap coherency and only work upon that part
of the file which is actually undergoing IO.
On Mon, 2005-11-21 at 15:34 -0800, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > Anything that calls lock_page() should be avoided in O_DIRECT,
>
> Why?
>
> And it's still doing lock_page():
>
> nfs_file_direct_write()
> ->filemap_fdatawrite()
> ->do_writepages()
> ->nfs_writepages()
> ->generic_writepages()
> ->mpage_writepages()
> ->lock_page()
True.
> > however
> > we should be able to call invalidate_inode_pages() since that doesn't
> > wait on the page lock.
>
> invalidate_inode_pages2() is better.
> And using generic_file_direct_IO() is
> better still, since it handles mmap coherency and only work upon that part
> of the file which is actually undergoing IO.
Unlike local filesystems, we don't want to have to take the i_sem in any
of the direct IO paths. The latter is just a liability as far as
applications are concerned: it doesn't offer any protection for local
data (there _is_ no local data to protect), but gets seriously in the
way of write parallelism.
The only difference I can see between the two paths is the call to
unmap_mapping_range(). What effect would that have?
Cheers,
Trond
Trond Myklebust <[email protected]> wrote:
>
> The only difference I can see between the two paths is the call to
> unmap_mapping_range(). What effect would that have?
It shoots down any mapped pagecache over the affected file region. Because
the direct-io write is about to make that pagecache out-of-date. If the
application tries to use that data again it'll get a major fault and will
re-read the file contents.
On Mon, 2005-11-21 at 16:09 -0800, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > The only difference I can see between the two paths is the call to
> > unmap_mapping_range(). What effect would that have?
>
> It shoots down any mapped pagecache over the affected file region. Because
> the direct-io write is about to make that pagecache out-of-date. If the
> application tries to use that data again it'll get a major fault and will
> re-read the file contents.
I assume then, that this couldn't be the cause of the
invalidate_inode_pages() failing to complete? Unless there is some
method to prevent applications from faulting in the page while we're
inside generic_file_direct_IO(), then the same race would be able to
occur there.
Cheers,
Trond
On Mon, 2005-11-21 at 19:18 -0500, Trond Myklebust wrote:
> On Mon, 2005-11-21 at 16:09 -0800, Andrew Morton wrote:
> > Trond Myklebust <[email protected]> wrote:
> > >
> > > The only difference I can see between the two paths is the call to
> > > unmap_mapping_range(). What effect would that have?
> >
> > It shoots down any mapped pagecache over the affected file region. Because
> > the direct-io write is about to make that pagecache out-of-date. If the
> > application tries to use that data again it'll get a major fault and will
> > re-read the file contents.
>
> I assume then, that this couldn't be the cause of the
> invalidate_inode_pages() failing to complete? Unless there is some
^^^^^^^^^^^^^^^^^^^^^^^^ invalidate_inode_pages2(), sorry....
> method to prevent applications from faulting in the page while we're
> inside generic_file_direct_IO(), then the same race would be able to
> occur there.
Cheers,
Trond
Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2005-11-21 at 16:09 -0800, Andrew Morton wrote:
> > Trond Myklebust <[email protected]> wrote:
> > >
> > > The only difference I can see between the two paths is the call to
> > > unmap_mapping_range(). What effect would that have?
> >
> > It shoots down any mapped pagecache over the affected file region. Because
> > the direct-io write is about to make that pagecache out-of-date. If the
> > application tries to use that data again it'll get a major fault and will
> > re-read the file contents.
>
> I assume then, that this couldn't be the cause of the
> invalidate_inode_pages() failing to complete?
It sounds unlikely. This hang is associated with crossing the 2G boundary
isn't it?
I don't think we've seen a sysrq-T trace from the hang?
> Unless there is some
> method to prevent applications from faulting in the page while we're
> inside generic_file_direct_IO(), then the same race would be able to
> occur there.
Yes, there are still windows.
Another thing the unmap_mapping_range() does is to push pte-dirty bits into
the software-dirty flags, so the modified data does get written. If we
didn't do this, a page which was dirtied via mmap before the direct-io
write would get written back _after_ the direct-io write, arguably causing
corruption.
On Mon, 2005-11-21 at 16:28 -0800, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > On Mon, 2005-11-21 at 16:09 -0800, Andrew Morton wrote:
> > > Trond Myklebust <[email protected]> wrote:
> > > >
> > > > The only difference I can see between the two paths is the call to
> > > > unmap_mapping_range(). What effect would that have?
> > >
> > > It shoots down any mapped pagecache over the affected file region. Because
> > > the direct-io write is about to make that pagecache out-of-date. If the
> > > application tries to use that data again it'll get a major fault and will
> > > re-read the file contents.
> >
> > I assume then, that this couldn't be the cause of the
> > invalidate_inode_pages() failing to complete?
>
> It sounds unlikely. This hang is associated with crossing the 2G boundary
> isn't it?
>
> I don't think we've seen a sysrq-T trace from the hang?
Kenny sent us this trace. The thing appears to be hanging in
unmap_mapping_range() as called by invalidate_inode_pages2()
Pid: 4271, comm: writetest
EIP: 0060:[<c029cd2b>] CPU: 0
EIP is at prio_tree_first+0x26/0xb8
EFLAGS: 00000292 Tainted: P (2.6.15-rc1)
EAX: 00000000 EBX: f6765d94 ECX: f6765d94 EDX: 00000200
ESI: 00000000 EDI: f5da475c EBP: f6765db4 DS: 007b ES: 007b
CR0: 8005003b CR2: b7590000 CR3: 36b5d000 CR4: 000006d0
[<c029ce5e>] prio_tree_next+0xa1/0xa3
[<c014822a>] vma_prio_tree_next+0x27/0x51
[<c014b1dc>] unmap_mapping_range+0x18b/0x210
[<c0120d06>] __do_softirq+0x6a/0xd1
[<c0103a24>] apic_timer_interrupt+0x1c/0x24
[<c0146577>] invalidate_inode_pages2_range+0x215/0x24c
[<c01465cd>] invalidate_inode_pages2+0x1f/0x26
[<c01e4031>] nfs_file_direct_write+0x1e1/0x21a
[<c0159ea4>] do_sync_write+0xc7/0x10d
[<c012ff32>] autoremove_wake_function+0x0/0x57
[<c0159f92>] vfs_write+0xa8/0x177
[<c015a275>] sys_pwrite64+0x88/0x8c
[<c0102f51>] syscall_call+0x7/0xb
> > Unless there is some
> > method to prevent applications from faulting in the page while we're
> > inside generic_file_direct_IO(), then the same race would be able to
> > occur there.
>
> Yes, there are still windows.
>
> Another thing the unmap_mapping_range() does is to push pte-dirty bits into
> the software-dirty flags, so the modified data does get written. If we
> didn't do this, a page which was dirtied via mmap before the direct-io
> write would get written back _after_ the direct-io write, arguably causing
> corruption.
As far as we're concerned, anybody using direct-io is responsible for
enforcing their own ordering. The pagecache is one thing, but in NFS,
direct IO is mainly used in situations where several clients are writing
to the same file. There is no way to ensure fully safe mmap() semantics.
Cheers,
Trond
Hi again... I'm still doing nfs tests.
With 2.6.15-rc3-mm1, a simple program can bring the system to a halt (as it can with previous
kernels).
I ran the test in single user mode, and copied the following output from sysrq-m, sysrq-t by
hand...
sysrq-t:
writetest:
io_schedule
sync_page
__wait_on_bit_lock
__lock_page
filemap_nopage
do_no_page
__handle_mm_fault
error_code
rpciod/0:
io_schedule_timeout
blk_congestion_wait
throttle_vm_writeout
shrink_zone
shrink_caches
try_to_free_pages
__alloc_pages
tcp_sendmsg
inet_sendmsg
sock_sendmsg
kernel_sendmsg
sock_no_sendpage
xs_tcp_send_request
xprt_transmit
call_transmit
__rpc_execute
rpc_async_schedule
worker_thread
kthread
kernel_thread_helper
sysrq-m:
Mem-Info:
DMA per-cpu:
cpu 0 hot: high 12 batch 2 used 0
cpu 0 cold: 4 1 0
cpu 1 hot: 12 2 1
cpu 1 cold: 4 1 0
cpu 2 hot: 12 2 0
cpu 2 cold: 4 1 0
cpu 3 hot: 12 2 0
cpu 3 cold: 4 1 0
DMA32 per-cpu: empty
Normal per-cpu:
cpu 0 hot: high 384 batch 64 used 1
cpu 0 cold: 128 32 0
cpu 1 hot: 384 64 97
cpu 1 cold: 128 32 0
cpu 2 hot: 384 64 63
cpu 2 cold: 128 32 32
cpu 3 hot: 384 64 47
cpu 3 cold: 128 32 0
Highmem per-cpu:
cpu 0: hot: high 384 batch 64 used 0
cpu 0: cold: 128 32 0
cpu 1: hot: high 384 batch 64 used 0
cpu 1: cold: 128 32 0
cpu 2: hot: high 384 batch 64 used 0
cpu 2: cold: 128 32 0
cpu 3: hot: high 384 batch 64 used 0
cpu 3: cold: 128 32 0
free pages: 14088kB (6000kB HighMem)
Active: 453253 inactive: 43719 dirty: 149725 writeback: 310580
unstable: 0 free: 3502 slab: 14870 mapped: 1230
524149 pages of RAM
294773 pages of HIGHMEM
6137 reserved pages
311956 pages shared
0 pages swap cache
149725 pages dirty
310580 pages writeback
1230 pages mapped
14870 pages slab
16 pages pagetables
This is the same test program as before.
It simply opens a file O_RDWR | O_CREAT | O_TRUNC | O_LARGEFILE,
grows the file by doing a pwrite64 of 1 byte,
maps the end of the file with mmap64(PROT_READ | PROT_WRITE, MAP_SHARED)
touches all the bytes by doing a memset
grows the file some more
unmaps, maps the new region, touches memory, ....
Once all the free memory on the system is used, no new processes can start, and the system is
effectively hung. Only sysrq and vt switching function (unless running X).
Any further info I could provide? Any ideas? Patches to try out?
thanks,
-Kenny
Here is the test program again (run as writetest -m <file-on-nfs>)
__________________________________
Yahoo! Music Unlimited
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/
You are still reporing free pages. Do you seen the OOM killer killing
processes?
How big is the file you are doing your test on? How big is your
filesize var when the box hangs?
If you run this test without nfs (on a local file system) do you end
up in this low memory state as well or only over a nfs mount?
thanks,
Keith
--- Keith Mannthey wrote:
> You are still reporing free pages. Do you seen the OOM killer killing
> processes?
I did not see anything being killed from the run under single user mode.
However, the sysrq does scroll past, so I only see the final 60 lines.
Is there a fool-proof way to check?
>
> How big is the file you are doing your test on? How big is your
> filesize var when the box hangs?
The File starts empty (the program creates it). The box hangs when the file is 5.9GB
(6308233217), or at least this is the file size when the box comes back.
>
> If you run this test without nfs (on a local file system) do you end
> up in this low memory state as well or only over a nfs mount?
I only see problems when running on nfs.
Other details:
nfs options:
rw,v3,rsize=32768,wsize=32768,hard,intr,lock,proto=tcp,addr=x.x.x.x
This is going over a dedicated Gb x-over cable to a clustered NetApp.
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
--- Keith Mannthey <[email protected]> wrote:
> You are still reporing free pages. Do you seen the OOM killer killing
> processes?
Running the test with /proc/sys/vm/overcommit_memory = 2, I get a similar
result. It still hangs after about 5.9GB, but it starts trying to write
out the file sooner.
Here is the stack trace I have for the process (again, by hand, what didn't scroll by as nothing
makes it to logs...)
writetest:
schedule_timeout
io_schedule_timeout
blk_congestion_wait
throttle_vm_writeout
shrink_zone
shrink_caches
try_to_free_pages
__alloc_pages
-> (up to here it matches the previous run's stack from rpciod/0)
kmem_getpages
cache_grow
cache_alloc_refill
kmem_cache_alloc
mempool_alloc_slab
mempool_alloc
nfs_flush_one
nfs_flush_list
nfs_flush_inode
nfs_write_pages
do_writepages
__filemap_fdatawrite_range
filemap_fdatawrite
filemap_write_and_wait
nfs_revalidate_mapping
nfs_file_write
do_sync_write
vfs_write
sys_pwrite64
The memory dump showed there was memory still available, with no swap in use.
-Kenny
__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com
Tested with rc5 - same results. It was suggested that I run slabtop when the system freezed, so
here is that info: (again, by hand, I'm getting another machine to use either netconsole or a
serial cable).
Active / Total Objects (% used) : 478764 / 485383 (98.6%)
Active / Total Slabs (% used) : 14618 / 14635 (99.9%)
Active / Total Caches (% used) : 79 / 138 (57.2%)
Active / Total Size (% used) : 56663.79K / 57566.41K (98.4%)
Minimum / Average / Maximum Object : 0.01K / 0.12K / 128.00K
OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
403088 403088 100% 0.06K 6832 59 27328K nfs_page
30380 30380 100% 0.50K 4340 7 17360K nfs_write_data
15134 15134 100% 0.27K 1081 14 4324K radix_tree_node
...
The other thing is that the stack trace showsd slabtop as being halted in throttle_vm_writeout
while allocating memory, and the writetest was halted waiting to allocate memory.
I'll get more detailed stack traces once I get the second machine set up.
-Kenny
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
> Tested with rc5 - same results.
Again, this time with serial console help:
The resulting file only grew to 1.9G (1971322881) bytes, so I don't think this is a 32-bit issue.
I'm attaching the result of sysrq-p, t, m
This was a run with writetest and slabtop.
Please let me know if anyone else is able to reproduce this behavior, or if there is some other
information I can/should be providing.
thanks,
-Kenny
__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
On Mon, 2005-12-05 at 10:01 -0800, Kenny Simpson wrote:
> Tested with rc5 - same results. It was suggested that I run slabtop when the system freezed, so
> here is that info: (again, by hand, I'm getting another machine to use either netconsole or a
> serial cable).
>
> Active / Total Objects (% used) : 478764 / 485383 (98.6%)
> Active / Total Slabs (% used) : 14618 / 14635 (99.9%)
> Active / Total Caches (% used) : 79 / 138 (57.2%)
> Active / Total Size (% used) : 56663.79K / 57566.41K (98.4%)
> Minimum / Average / Maximum Object : 0.01K / 0.12K / 128.00K
>
> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
> 403088 403088 100% 0.06K 6832 59 27328K nfs_page
> 30380 30380 100% 0.50K 4340 7 17360K nfs_write_data
> 15134 15134 100% 0.27K 1081 14 4324K radix_tree_node
> ...
>
>
> The other thing is that the stack trace showsd slabtop as being halted in throttle_vm_writeout
> while allocating memory, and the writetest was halted waiting to allocate memory.
Can somebody VM-savvy please explain how on earth they expect something
like throttle_vm_writeout() to make progress? Shouldn't that thing at
the very least be kicking pdflush every time it loops?
Cheers,
Trond
> I'll get more detailed stack traces once I get the second machine set up.
>
> -Kenny
>
>
>
>
> __________________________________
> Start your day with Yahoo! - Make it your home page!
> http://www.yahoo.com/r/hs
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Tested with rc5 + the ALL patch from http://linux-nfs.org/Linux-2.6.x/2.6.15-rc5/
- same results.
I'm attaching the sysrq output from that run (no slabtop this time, just the test).
-Kenny
__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
VM: Ensure that throttle_vm_writeout() can make progress
Once a process is in the loop inside throttle_vm_writeout(), it has
no guarantee that it will ever get out, since there is nothing that
will kickstart the flushing of unstable writes.
Signed-off-by: Trond Myklebust <[email protected]>
---
mm/page-writeback.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5240e42..9a66dee 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -306,6 +306,8 @@ void throttle_vm_writeout(void)
if (wbs.nr_unstable + wbs.nr_writeback <= dirty_thresh)
break;
+ if (wbs.nr_unstable != 0)
+ wakeup_pdflush(wbs.nr_unstable);
blk_congestion_wait(WRITE, HZ/10);
}
}
Trond Myklebust wrote:
> On Mon, 2005-12-05 at 15:13 -0500, Trond Myklebust wrote:
>
>>Can somebody VM-savvy please explain how on earth they expect something
>>like throttle_vm_writeout() to make progress? Shouldn't that thing at
>>the very least be kicking pdflush every time it loops?
>
>
> Can you try something like this patch, Kenny?
>
The VM doesn't expect to have to rely on pdflush to write out pages
for it. ->writepage should be enough. Adding wakeup_pdflush here
actually could do the wrong thing for non-NFS filesystems if it
starts more writeback.
Nick
Send instant messages to your online friends http://au.messenger.yahoo.com
--- Trond Myklebust <[email protected]> wrote:
> > Can somebody VM-savvy please explain how on earth they expect something
> > like throttle_vm_writeout() to make progress? Shouldn't that thing at
> > the very least be kicking pdflush every time it loops?
>
> Can you try something like this patch, Kenny?
>
> Cheers,
> Trond
>
>
> > VM: Ensure that throttle_vm_writeout() can make progress
No change. :(
-Kenny
__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
On Tue, 2005-12-06 at 07:52 +1100, Nick Piggin wrote:
> The VM doesn't expect to have to rely on pdflush to write out pages
> for it. ->writepage should be enough. Adding wakeup_pdflush here
> actually could do the wrong thing for non-NFS filesystems if it
> starts more writeback.
nr_unstable is not going to be set for non-NFS filesystems. 'unstable'
is a caching state in which pages have been written out to the NFS
server, but the server has not yet flushed the data to disk.
Cheers,
Trond
On Mon, 2005-12-05 at 16:18 -0500, Trond Myklebust wrote:
> On Tue, 2005-12-06 at 07:52 +1100, Nick Piggin wrote:
>
> > The VM doesn't expect to have to rely on pdflush to write out pages
> > for it. ->writepage should be enough. Adding wakeup_pdflush here
> > actually could do the wrong thing for non-NFS filesystems if it
> > starts more writeback.
>
> nr_unstable is not going to be set for non-NFS filesystems. 'unstable'
> is a caching state in which pages have been written out to the NFS
> server, but the server has not yet flushed the data to disk.
...and most important of all: 'unstable' does _not_ mean that I/O is
active on those pages (unlike the apparent assumption in
vm_throttle_write.
That is why the choice is either to kick pdflush there, or to remove
nr_unstable from the accounting in that loop.
Cheers,
Trond
--- Trond Myklebust <[email protected]> wrote:
> nr_unstable is not going to be set for non-NFS filesystems. 'unstable'
> is a caching state in which pages have been written out to the NFS
> server, but the server has not yet flushed the data to disk.
The NetApp always seems to return stable writes (even when the request are not).
Either way, I ran the system under minimal user mode (i.e. init=/bin/bash), to try to cut down on
the superfluous processes (cron, ntp, etc..).
I ran the test again.. same general result.
The tesr program is unkillable - even with sysrq's.
-Kenny
__________________________________
Start your day with Yahoo! - Make it your home page!
http://www.yahoo.com/r/hs
RPC: Do not block on skb allocation
If we get something like the following,
[ 125.300636] [<c04086e1>] schedule_timeout+0x54/0xa5
[ 125.305931] [<c040866e>] io_schedule_timeout+0x29/0x33
[ 125.311495] [<c02880c4>] blk_congestion_wait+0x70/0x85
[ 125.317058] [<c014136b>] throttle_vm_writeout+0x69/0x7d
[ 125.322720] [<c014714d>] shrink_zone+0xe0/0xfa
[ 125.327560] [<c01471d4>] shrink_caches+0x6d/0x6f
[ 125.332581] [<c01472a6>] try_to_free_pages+0xd0/0x1b5
[ 125.338056] [<c013fa4b>] __alloc_pages+0x135/0x2e8
[ 125.343258] [<c03b74ad>] tcp_sendmsg+0xaa0/0xb78
[ 125.348281] [<c03d4666>] inet_sendmsg+0x48/0x53
[ 125.353212] [<c0388716>] sock_sendmsg+0xb8/0xd3
[ 125.358147] [<c0388773>] kernel_sendmsg+0x42/0x4f
[ 125.363259] [<c038bc00>] sock_no_sendpage+0x5e/0x77
[ 125.368556] [<c03ee7af>] xs_tcp_send_request+0x2af/0x375
then the socket is blocked until memory is reclaimed, and no
progress can ever be made.
Try to access the emergency pools by using GFP_ATOMIC.
Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/xprtsock.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 0a51fd4..77e8800 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -990,6 +990,7 @@ static void xs_udp_connect_worker(void *
sk->sk_data_ready = xs_udp_data_ready;
sk->sk_write_space = xs_udp_write_space;
sk->sk_no_check = UDP_CSUM_NORCV;
+ sk->sk_allocation = GFP_ATOMIC;
xprt_set_connected(xprt);
@@ -1074,6 +1075,7 @@ static void xs_tcp_connect_worker(void *
sk->sk_data_ready = xs_tcp_data_ready;
sk->sk_state_change = xs_tcp_state_change;
sk->sk_write_space = xs_tcp_write_space;
+ sk->sk_allocation = GFP_ATOMIC;
/* socket options */
sk->sk_userlocks |= SOCK_BINDPORT_LOCK;
Trond Myklebust wrote:
> On Mon, 2005-12-05 at 16:18 -0500, Trond Myklebust wrote:
>
>>On Tue, 2005-12-06 at 07:52 +1100, Nick Piggin wrote:
>>
>>
>>>The VM doesn't expect to have to rely on pdflush to write out pages
>>>for it. ->writepage should be enough. Adding wakeup_pdflush here
>>>actually could do the wrong thing for non-NFS filesystems if it
>>>starts more writeback.
>>
>>nr_unstable is not going to be set for non-NFS filesystems. 'unstable'
>>is a caching state in which pages have been written out to the NFS
>>server, but the server has not yet flushed the data to disk.
>
But if you have NFS and non-NFS filesystems, wakeup_pdflush isn't
always going to do the right thing.
>
> ...and most important of all: 'unstable' does _not_ mean that I/O is
> active on those pages (unlike the apparent assumption in
> vm_throttle_write.
> That is why the choice is either to kick pdflush there, or to remove
> nr_unstable from the accounting in that loop.
>
Doesn't matter if IO is actually active or not, if you've allocated
memory for these unstable pages, then page reclaim can scan itself
to death, which is what seems to have happened here. And which is
what vm_throttle_write is supposed to help.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 2005-12-06 at 10:40 +1100, Nick Piggin wrote:
> > ...and most important of all: 'unstable' does _not_ mean that I/O is
> > active on those pages (unlike the apparent assumption in
> > vm_throttle_write.
> > That is why the choice is either to kick pdflush there, or to remove
> > nr_unstable from the accounting in that loop.
> >
>
> Doesn't matter if IO is actually active or not, if you've allocated
> memory for these unstable pages, then page reclaim can scan itself
> to death, which is what seems to have happened here. And which is
> what vm_throttle_write is supposed to help.
Unless someone somehow triggers an NFS commit, then nr_unstable is not
ever going to decrease, and your process will end up looping forever. In
fact, those nr_writeback that refer to NFS pages, will end up being
added to nr_unstable (because they have been written to the server, but
not committed to disk).
Cheers,
Trond
Trond Myklebust <[email protected]> wrote:
>
> Argh... Not sure entirely how to deal with that... We definitely don't
> want the thing futzing around inside throttle_vm_writeout(), 'cos
> writeout isn't going to happen while the socket blocks.
>
As far as the core VM is concerned, these pages are really "dirty", only it
happens to be a different flavour of dirtiness. So perhaps we should
continue to mark these pages as dirty and let NFS internally take care
of which end of the wire they're dirty at.
Presumably calling writepage() a second time won't be very useful. Or will
it? Perhaps when NFS sees writepage against a PageDirty && PageUnstable
page it can recognise that as a hint to kick off a server-side write.
> ...OTOH, I'm not entirely sure we want to use GFP_ATOMIC and risk
> bleeding the emergency pools dry: we also need those in order to receive
> replies from the server.
You can use (GFP_ATOMIC & ~__GFP_HIGH) to avoid draining emergency pools.
Trond Myklebust <[email protected]> wrote:
>
> Argh... Not sure entirely how to deal with that... We definitely don't
> want the thing futzing around inside throttle_vm_writeout(), 'cos
> writeout isn't going to happen while the socket blocks.
>
As far as the core VM is concerned, these pages are really "dirty", only it
happens to be a different flavour of dirtiness. So perhaps we should
continue to mark these pages as dirty and let NFS internally take care
of which end of the wire they're dirty at.
Presumably calling writepage() a second time won't be very useful. Or will
it? Perhaps when NFS sees writepage against a PageDirty && PageUnstable
page it can recognise that as a hint to kick off a server-side write.
> ...OTOH, I'm not entirely sure we want to use GFP_ATOMIC and risk
> bleeding the emergency pools dry: we also need those in order to receive
> replies from the server.
You can use (GFP_ATOMIC & !__GFP_HIGH) to avoid draining emergency pools.
On Tue, 2005-12-06 at 14:36 +1100, Andrew Morton wrote:
> Trond Myklebust <[email protected]> wrote:
> >
> > Argh... Not sure entirely how to deal with that... We definitely don't
> > want the thing futzing around inside throttle_vm_writeout(), 'cos
> > writeout isn't going to happen while the socket blocks.
> >
>
> As far as the core VM is concerned, these pages are really "dirty", only it
> happens to be a different flavour of dirtiness. So perhaps we should
> continue to mark these pages as dirty and let NFS internally take care
> of which end of the wire they're dirty at.
>
> Presumably calling writepage() a second time won't be very useful. Or will
> it? Perhaps when NFS sees writepage against a PageDirty && PageUnstable
> page it can recognise that as a hint to kick off a server-side write.
Calling writepages() would actually be better. That will do the right
thing, and trigger a commit if there are unstable writes.
Cheers,
Trond
Trond Myklebust wrote:
> On Tue, 2005-12-06 at 14:36 +1100, Andrew Morton wrote:
>
>>Trond Myklebust <[email protected]> wrote:
>>
>>>Argh... Not sure entirely how to deal with that... We definitely don't
>>> want the thing futzing around inside throttle_vm_writeout(), 'cos
>>> writeout isn't going to happen while the socket blocks.
>>>
>>
>>As far as the core VM is concerned, these pages are really "dirty", only it
>>happens to be a different flavour of dirtiness. So perhaps we should
>>continue to mark these pages as dirty and let NFS internally take care
>>of which end of the wire they're dirty at.
>>
>>Presumably calling writepage() a second time won't be very useful. Or will
>>it? Perhaps when NFS sees writepage against a PageDirty && PageUnstable
>>page it can recognise that as a hint to kick off a server-side write.
>
>
> Calling writepages() would actually be better. That will do the right
> thing, and trigger a commit if there are unstable writes.
>
writepage should as well, then it would have a better chance
of just doing the right thing.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 2005-12-06 at 16:42 +1100, Nick Piggin wrote:
> writepage should as well, then it would have a better chance
> of just doing the right thing.
writepage triggers a stable write of the page (i.e. the page is written
directly to disk) if asked to reclaim it.
If the VM wants the unstable writes from the mapping to be committed, it
should call writepages.
Cheers,
Trond
Hi,
The patch linux-2.6.15-fix_sock_allocation.dif seems
to have helped with this issue.
With this patch applied I have been unable to
reproduce the system freezes I was experiencing in
latest 2.6.x kernels when using nfs-root on my
GameCube (24MB RAM).
Thanks,
Albert
______________________________________________
Renovamos el Correo Yahoo!
Nuevos servicios, m?s seguridad
http://correo.yahoo.es
--- Trond Myklebust <[email protected]> wrote:
> Gah... This is the problem:
...
>
> Ah, well... Try GFP_ATOMIC first, and see if that helps.
That DOES IT!!!!
-Kenny
__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com
--- Andrew Morton <[email protected]> wrote:
> Trond Myklebust <[email protected]> wrote:
> > ...OTOH, I'm not entirely sure we want to use GFP_ATOMIC and risk
> > bleeding the emergency pools dry: we also need those in order to receive
> > replies from the server.
>
> You can use (GFP_ATOMIC & ~__GFP_HIGH) to avoid draining emergency pools.
>
>
After beating on this for a while now, it all seems very happy. The write out to nfs are a little
choppy, but make forward progress.
Any chance of this being in 2.6.15?
-Kenny
__________________________________________
Yahoo! DSL ? Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com