Hi,
one of my systems is running Debian stable with a self-compiled Linux
kernel. On this system, Debian's aptitude binary is started hourly
from cron to check for new packages (including virus scan definition
packages, this is actually the reason for the update running so often).
After updating to 2.6.19, Debian's apt control file
/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
six hours. In that situation, "aptitude update" segfaults. When I
delete the file and have apt recreate it, things are fine again for a
few hours before the file is broken again and the segfault start over.
In all cases, umounting the file system and doing an fsck does not
show issues with the file system.
I went back to 2.6.18.3 to debug this, and the system ran for three
days without problems and without corrupting
/var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
hours for the file corruption to show again.
I do not have an idea what could cause this other than the 2.6.19
kernel.
The file system in question is an ext3fs on an LVM LV, which is member
of a VG that only has a single PV, which in turn is on a primary
partition of the first IDE hard disk, hda. The IDE interface is a VIA
Technologies, Inc. VT82C586A/B/VT82C686/A/B/VT823x/A/C PIPC Bus Master
IDE (rev 06). The box is a rented server in a colocation, and I do not
have access to the console or physical access to the box itself.
I'll happily deliver information that might be needed to nail down
this issue. Can anybody give advice about how to solve this?
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis V?zquez Cao wrote:
> Does the patch below help?
>
> http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
No, pkgcache.bin still getting corrupted within two hours of using
2.6.19.
Greetings
Marc, back to 2.6.18.3 for the time being
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
Marc Haber wrote:
> I went back to 2.6.18.3 to debug this, and the system ran for three
> days without problems and without corrupting
> /var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
> hours for the file corruption to show again.
>
> I do not have an idea what could cause this other than the 2.6.19
> kernel.
<snip>
> I'll happily deliver information that might be needed to nail down
> this issue. Can anybody give advice about how to solve this?
I'd say start git bisecting to track down which commit the problem
starts at.
On Thu, 2006-12-07 at 11:50 -0500, Phillip Susi wrote:
> Marc Haber wrote:
> > I went back to 2.6.18.3 to debug this, and the system ran for three
> > days without problems and without corrupting
> > /var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
> > hours for the file corruption to show again.
> >
> > I do not have an idea what could cause this other than the 2.6.19
> > kernel.
> <snip>
> > I'll happily deliver information that might be needed to nail down
> > this issue. Can anybody give advice about how to solve this?
>
> I'd say start git bisecting to track down which commit the problem
> starts at.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Does the patch below help?
http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
On Thu, Dec 07, 2006 at 11:50:37AM -0500, Phillip Susi wrote:
> Marc Haber wrote:
> >I went back to 2.6.18.3 to debug this, and the system ran for three
> >days without problems and without corrupting
> >/var/cache/apt/pkgcache.bin. After booting 2.6.19 again, it took three
> >hours for the file corruption to show again.
> >
> >I do not have an idea what could cause this other than the 2.6.19
> >kernel.
> <snip>
> >I'll happily deliver information that might be needed to nail down
> >this issue. Can anybody give advice about how to solve this?
>
> I'd say start git bisecting to track down which commit the problem
> starts at.
Unfortunately, I am lacking the knowledge needed to do this in an
informed way. I am neither familiar enough with git nor do I possess
the necessary C powers.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
> On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis V?zquez Cao wrote:
> > Does the patch below help?
> >
> > http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
>
> No, pkgcache.bin still getting corrupted within two hours of using
> 2.6.19.
Hmm, interesting. I'll try to reproduce the problem. In the mean time
- does mounting the filesystem with data=writeback help?
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Fri, 2006-12-08 at 17:42 +0100, Marc Haber wrote:
> On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis V?zquez Cao wrote:
> > Does the patch below help?
> >
> > http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
>
> No, pkgcache.bin still getting corrupted within two hours of using
> 2.6.19.
>
> Greetings
> Marc, back to 2.6.18.3 for the time being
Hi,
I've missed most of this thread, but have cause to be interested. Do
you have a generic recipe for reproducing file corruption? I seem to be
(read pretty darn sure, modulus hw (wish) vs sw testing methods...)
experiencing memory corruption problems with 2.6.19, and am interested
in anything that might be related (trigger!).
-Mike
On Sun, Dec 10, 2006 at 12:46:01AM +0100, Mike Galbraith wrote:
> On Fri, 2006-12-08 at 17:42 +0100, Marc Haber wrote:
> > On Fri, Dec 08, 2006 at 10:38:12AM +0900, Fernando Luis V?zquez Cao wrote:
> > > Does the patch below help?
> > >
> > > http://marc.theaimsgroup.com/?l=linux-ext4&m=116483980823714&w=4
> >
> > No, pkgcache.bin still getting corrupted within two hours of using
> > 2.6.19.
> >
> > Greetings
> > Marc, back to 2.6.18.3 for the time being
>
> Hi,
>
> I've missed most of this thread, but have cause to be interested. Do
> you have a generic recipe for reproducing file corruption?
My recipe is running apt-get update from cron. This needs Debian
though. Maybe a chroot installation will suffice.
I'm going to try data=writeback first.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
On Sat, Dec 09, 2006 at 11:47:58AM +0100, Jan Kara wrote:
> In the mean time
> does mounting the filesystem with data=writeback help?
I have now nine hours uptime with data=writeback, and the file is
still OK. Looks good.
By this posting, I'm going to invoke murphy, so I'll report again
tomorrow.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
> On Sat, Dec 09, 2006 at 11:47:58AM +0100, Jan Kara wrote:
> > In the mean time
> > does mounting the filesystem with data=writeback help?
>
> I have now nine hours uptime with data=writeback, and the file is
> still OK. Looks good.
>
> By this posting, I'm going to invoke murphy, so I'll report again
> tomorrow.
Since you haven't written till today I assume that data=writeback does
not have a problem. Hmm. I really start to suspect my changes to JBD
commit code. But I was trying to reproduce the problem by copying files
there and back without success :( Also I check the code and I don't see
how we could loose dirty bits on buffers (which is probably what happens
as one guy has written to me that he also sees the problem when using
rtorrent which does checksum after downloading and that passes fine).
Next I'm going to try to reproduce the problem with heavy mmap load.
Maybe that would trigger it.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Thu, Dec 14, 2006 at 01:03:41PM +0100, Jan Kara wrote:
> > On Sat, Dec 09, 2006 at 11:47:58AM +0100, Jan Kara wrote:
> > > In the mean time
> > > does mounting the filesystem with data=writeback help?
> >
> > I have now nine hours uptime with data=writeback, and the file is
> > still OK. Looks good.
> >
> > By this posting, I'm going to invoke murphy, so I'll report again
> > tomorrow.
> Since you haven't written till today I assume that data=writeback does
> not have a problem.
It does not have a problem, right. Additionally, updating to 2.6.19.1
allowed me to remove data=writeback without the issue re-surfacing. I
suspect that the issue is fixed now.
Thanks for helping.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
On Fri, Dec 15, 2006 at 10:30:34AM +0100, Marc Haber wrote:
> Additionally, updating to 2.6.19.1
> allowed me to remove data=writeback without the issue re-surfacing. I
> suspect that the issue is fixed now.
Unfortunately, this suspicion proved wrong when the file was corrupted
again this morning.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
* Marc Haber <[email protected]> [2006-12-09 10:26]:
> Unfortunately, I am lacking the knowledge needed to do this in an
> informed way. I am neither familiar enough with git nor do I possess
> the necessary C powers.
I wonder if what you're seein is related to
http://lkml.org/lkml/2006/12/16/73
You said that you don't see any corruption with 2.6.18. Can you try
to apply the patch from
http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
to 2.6.18 to see if the corruption shows up?
--
Martin Michlmayr
[email protected]
* Marc Haber:
> After updating to 2.6.19, Debian's apt control file
> /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> six hours.
I've seen that with Debian's 2.6.18 kernels as well. Perhaps it's
related to this Debian bug?
<http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=401006>
On Sat, 16 Dec 2006, Martin Michlmayr wrote:
> * Marc Haber <[email protected]> [2006-12-09 10:26]:
> > Unfortunately, I am lacking the knowledge needed to do this in an
> > informed way. I am neither familiar enough with git nor do I possess
> > the necessary C powers.
>
> I wonder if what you're seein is related to
> http://lkml.org/lkml/2006/12/16/73
>
> You said that you don't see any corruption with 2.6.18. Can you try
> to apply the patch from
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> to 2.6.18 to see if the corruption shows up?
I did wonder about the very first hunk of Peter's patch, where the
mapping->private_lock is unlocked earlier now in try_to_free_buffers,
before the clear_page_dirty. I'm not at all familiar with that area,
I wonder if Jan has looked at that change, and might be able to say
whether it's good or not (earlier he worried about his JBD changes,
but they wouldn't be implicated if just 2.6.18+Peter's gives trouble).
Hugh
On Sat, 2006-12-16 at 19:18 +0000, Hugh Dickins wrote:
> On Sat, 16 Dec 2006, Martin Michlmayr wrote:
> > * Marc Haber <[email protected]> [2006-12-09 10:26]:
> > > Unfortunately, I am lacking the knowledge needed to do this in an
> > > informed way. I am neither familiar enough with git nor do I possess
> > > the necessary C powers.
> >
> > I wonder if what you're seein is related to
> > http://lkml.org/lkml/2006/12/16/73
> >
> > You said that you don't see any corruption with 2.6.18. Can you try
> > to apply the patch from
> > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> > to 2.6.18 to see if the corruption shows up?
>
> I did wonder about the very first hunk of Peter's patch, where the
> mapping->private_lock is unlocked earlier now in try_to_free_buffers,
> before the clear_page_dirty. I'm not at all familiar with that area,
> I wonder if Jan has looked at that change, and might be able to say
> whether it's good or not (earlier he worried about his JBD changes,
> but they wouldn't be implicated if just 2.6.18+Peter's gives trouble).
fs/buffers.c:2775
/*
* try_to_free_buffers() checks if all the buffers on this particular page
* are unused, and releases them if so.
*
* Exclusion against try_to_free_buffers may be obtained by either
* locking the page or by holding its mapping's private_lock.
*
* If the page is dirty but all the buffers are clean then we need to
* be sure to mark the page clean as well. This is because the page
* may be against a block device, and a later reattachment of buffers
* to a dirty page will set *all* buffers dirty. Which would corrupt
* filesystem data on the same device.
*
* The same applies to regular filesystem pages: if all the buffers are
* clean then we set the page clean and proceed. To do that, we require
* total exclusion from __set_page_dirty_buffers(). That is obtained with
* private_lock.
*
* try_to_free_buffers() is non-blocking.
*/
Note the 3th paragraph. Would I have opened up a race by moving that
unlock upwards, such that it is possible to re-attach buffers to the
page before having it marked clean; which according to this text will
mark those buffers dirty and cause data corruption?
Hmm, how to go about something like this:
---
Moving the cleaning of the page out from under the private_lock opened
up a window where newly attached buffer might still see the page dirty
status and were thus marked (incorrectly) dirty themselves; resulting in
filesystem data corruption.
Close this by moving the cleaning of the page inside of the private_lock
scope again. However it is not possible to call page_mkclean() from
within the private_lock (this violates locking order); thus introduce a
variant of test_clear_page_dirty() that does not call page_mkclean() and
call it ourselves when we did do clean the page and call it outside of
the private_lock.
This is still safe because the page is still locked by means of
PG_locked.
Signed-off-by: Peter Zijlstra <[email protected]>
---
fs/buffer.c | 11 +++++++++--
include/linux/page-flags.h | 1 +
mm/page-writeback.c | 10 ++++++++--
3 files changed, 18 insertions(+), 4 deletions(-)
Index: linux-2.6-git/fs/buffer.c
===================================================================
--- linux-2.6-git.orig/fs/buffer.c 2006-12-16 22:18:24.000000000 +0100
+++ linux-2.6-git/fs/buffer.c 2006-12-16 22:22:17.000000000 +0100
@@ -42,6 +42,7 @@
#include <linux/bitops.h>
#include <linux/mpage.h>
#include <linux/bit_spinlock.h>
+#include <linux/rmap.h>
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
static void invalidate_bh_lrus(void);
@@ -2832,6 +2833,7 @@ int try_to_free_buffers(struct page *pag
struct address_space * const mapping = page->mapping;
struct buffer_head *buffers_to_free = NULL;
int ret = 0;
+ int must_clean = 0;
BUG_ON(!PageLocked(page));
if (PageWriteback(page))
@@ -2844,7 +2846,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
- spin_unlock(&mapping->private_lock);
if (ret) {
/*
* If the filesystem writes its buffers by hand (eg ext3)
@@ -2858,9 +2859,15 @@ int try_to_free_buffers(struct page *pag
* the page's buffers clean. We discover that here and clean
* the page also.
*/
- if (test_clear_page_dirty(page))
+ if (__test_clear_page_dirty(page, 0)) {
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
+ if (mapping_cap_account_dirty(mapping))
+ must_clean = 1;
+ }
}
+ spin_unlock(&mapping->private_lock);
+ if (must_clean)
+ page_mkclean(page);
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
Index: linux-2.6-git/include/linux/page-flags.h
===================================================================
--- linux-2.6-git.orig/include/linux/page-flags.h 2006-12-16 22:19:56.000000000 +0100
+++ linux-2.6-git/include/linux/page-flags.h 2006-12-16 22:20:07.000000000 +0100
@@ -253,6 +253,7 @@ static inline void SetPageUptodate(struc
struct page; /* forward declaration */
+int __test_clear_page_dirty(struct page *page, int do_clean);
int test_clear_page_dirty(struct page *page);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c 2006-12-16 22:18:18.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c 2006-12-16 22:19:42.000000000 +0100
@@ -854,7 +854,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int __test_clear_page_dirty(struct page *page, int do_clean)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -872,7 +872,8 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ if (do_clean)
+ page_mkclean(page);
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
@@ -880,6 +881,11 @@ int test_clear_page_dirty(struct page *p
write_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
}
+
+int test_clear_page_dirty(struct page *page)
+{
+ return __test_clear_page_dirty(page, 1);
+}
EXPORT_SYMBOL(test_clear_page_dirty);
/*
On Sat, 16 Dec 2006, Peter Zijlstra wrote:
> Moving the cleaning of the page out from under the private_lock opened
> up a window where newly attached buffer might still see the page dirty
> status and were thus marked (incorrectly) dirty themselves; resulting in
> filesystem data corruption.
I'm not going to pretend to understand the buffers issues here:
people thought that change was safe originally, and I can't say
it's not - it just stood out as a potentially weakening change.
The patch you propose certainly looks like a good way out, if
that moved unlock really is a problem: your patch is very well
worth trying by those people seeing their corruption problems,
let's wait to hear their feedback.
Thanks!
Hugh
Hello,
I had filesystem data corruption with rtorrent with 2.6.19.
I tried recent git with Peter Zijlstra patch
http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
fixed.
Please CC as I am not subscribed to lkml.
Andrei
On Sat, 16 Dec 2006 19:31:25 +0100
Florian Weimer <[email protected]> wrote:
> * Marc Haber:
>
> > After updating to 2.6.19, Debian's apt control file
> > /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> > six hours.
>
> I've seen that with Debian's 2.6.18 kernels as well. Perhaps it's
> related to this Debian bug?
>
> <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=401006>
ugh, that's pretty damning. And rtorrent uses MAP_SHARED.
On Sun, 17 Dec 2006 02:13:18 +0200
Andrei Popa <[email protected]> wrote:
> Hello,
> I had filesystem data corruption with rtorrent with 2.6.19.
> I tried recent git with Peter Zijlstra patch
> http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> fixed.
>
oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
ptes.
I'd be really surprised if this was all due to a race though. Is everyone
who has observed this problem running SMP and/or premptible kernels?
Peter, why isn't that proposed patch's cleaning of the pte racy against
do_wp_page()?
On Sun, Dec 17, 2006 at 04:06:20AM -0800, Andrew Morton wrote:
> I'd be really surprised if this was all due to a race though. Is everyone
> who has observed this problem running SMP and/or premptible kernels?
Linux torres 2.6.19.1-zgsrv #1 SMP PREEMPT Wed Dec 13 01:31:27 UTC 2006 i686 GNU/Linux
So, it's a "yes" to both counts, and I'll build a kernel without SMP
and without preemption asap.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
ierdnac ~ # uname -a
Linux ierdnac 2.6.20-rc1 #1 SMP PREEMPT Sun Dec 17 01:52:28 EET 2006
i686 Genuine Intel(R) CPU T2050 @ 1.60GHz GenuineIntel
GNU/Linux
On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 02:13:18 +0200
> Andrei Popa <[email protected]> wrote:
>
> > Hello,
> > I had filesystem data corruption with rtorrent with 2.6.19.
> > I tried recent git with Peter Zijlstra patch
> > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> > fixed.
> >
>
> oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
> ptes.
>
> I'd be really surprised if this was all due to a race though. Is everyone
> who has observed this problem running SMP and/or premptible kernels?
>
> Peter, why isn't that proposed patch's cleaning of the pte racy against
> do_wp_page()?
I was mistaken, I'm still having file corruption with rtorrent.
On Sun, 2006-12-17 at 04:06 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 02:13:18 +0200
> Andrei Popa <[email protected]> wrote:
>
> > Hello,
> > I had filesystem data corruption with rtorrent with 2.6.19.
> > I tried recent git with Peter Zijlstra patch
> > http://lkml.org/lkml/2006/12/16/144 and it seems that the problem is
> > fixed.
> >
>
> oh crap, I'd forgotten that test_clear_page_dirty() now fiddles with the
> ptes.
>
> I'd be really surprised if this was all due to a race though. Is everyone
> who has observed this problem running SMP and/or premptible kernels?
>
> Peter, why isn't that proposed patch's cleaning of the pte racy against
> do_wp_page()?
> On Sat, 16 Dec 2006, Martin Michlmayr wrote:
> > * Marc Haber <[email protected]> [2006-12-09 10:26]:
> > > Unfortunately, I am lacking the knowledge needed to do this in an
> > > informed way. I am neither familiar enough with git nor do I possess
> > > the necessary C powers.
> >
> > I wonder if what you're seein is related to
> > http://lkml.org/lkml/2006/12/16/73
> >
> > You said that you don't see any corruption with 2.6.18. Can you try
> > to apply the patch from
> > http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> > to 2.6.18 to see if the corruption shows up?
>
> I did wonder about the very first hunk of Peter's patch, where the
> mapping->private_lock is unlocked earlier now in try_to_free_buffers,
> before the clear_page_dirty. I'm not at all familiar with that area,
> I wonder if Jan has looked at that change, and might be able to say
> whether it's good or not (earlier he worried about his JBD changes,
> but they wouldn't be implicated if just 2.6.18+Peter's gives trouble).
Thanks for pointer. I was not aware of this change, I'll have a look
at it on Monday. Actually Mickael has checked that he sees corruption
even if all the JBD changes are backed out so I was going to look for
other changes in VFS that could cause that.
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Sun, 17 Dec 2006 15:39:32 +0200
Andrei Popa <[email protected]> wrote:
> I was mistaken, I'm still having file corruption with rtorrent.
>
Well I'm not very optimistic, but if people could try this, please...
From: Andrew Morton <[email protected]>
try_to_free_buffers() clears the page's dirty state if it successfully removed
the page's buffers.
Background for this:
- a process does a one-byte-write to a file on a 64k pagesize, 4k
blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and
has one dirty buffer and 15 not uptodate buffers.
- kjournald writes the dirty buffer. The page is now PageDirty,
!PageUptodate and has a mix of clean and not uptodate buffers.
- try_to_free_buffers() removes the page's buffers. It MUST now clear
PageDirty. If we were to leave the page dirty then we'd have a dirty, not
uptodate page with no buffer_heads.
We're screwed: we cannot write the page because we don't know which
sections of it contain garbage. We cannot read the page because we don't
know which sections of it contain modified data. We cannot free the page
because it is dirty.
Peter's "mm: tracking shared dirty pages"
(d08b3851da41d0ee60851f2c75b118e1f7a5fc89) modified clear_page_dirty() so that
it also clears the page's pte mapping's dirty flags, arranging for a
subsequent userspace modification of the page to cause a fault.
That change to clear_page_dirty() was correct for when it is called on the
writeback path. Here, we effectively do:
ClearPageDirty()
pte_mkclean()
submit-the-writeout
if a page-dirtying via write() or via pte's happens after the ClearPageDirty()
or the pte_mkclean() then the page is redirtied while writeout is in flight
and the page will again need writing; no probs.
But that change to clear_page_dirty() was incorrect for when it is called on
the try_to_free_buffers() path. Here, we want to preserve any pte-dirtiness
because we're not going to write the page to backing store. We need to keep
a record of any userspace modification to the page.
One way of addressing this would be to bale from try_to_free_buffers() if the
page is mapped into pagetables. However that is racy, because the pagefault
path doesn't lock the page when establishing a pte against it (I which it did
- it would solve a lot of nasties).
So this patch instead arranges for clear_page_dirty() to not clean the pte's
when it is called on the try_to_free_buffers() path.
clear_page_dirty() had several callers and it's not immediately obvious to me
what the appropriate behaviour is in each case. Could maintainers please take
a look?
>From my quick reading, all callers of try_to_free_buffers() have already
unmapped the page from pagetables, and given that the reported ext3 corruption
happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
things.
But even if it is true that try_to_free_buffers() callers unmap the page
first, this fix is still needed, because a minor fault could reestablish pte's
in the meanwhile.
Note that with this change, we can now restore try_to_free_buffers()'s
->private_lock to cover the test_clear_page_dirty(). If we indeed need to do
that, it'll be in a separate patch.
(Need to think about this some more. How can a page be pte-dirty, but not
have dirty buffers? We're supposed to clean the pte's when we write the
page, and we dirty the page and buffers when userspace dirties the pte...)
Cc: Miklos Szeredi <[email protected]>
Cc: <[email protected]>
Cc: Dave Kleikamp <[email protected]>
Cc: David Chinner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/buffer.c | 2 +-
fs/cifs/file.c | 2 +-
fs/fuse/file.c | 2 +-
fs/hugetlbfs/inode.c | 2 +-
fs/jfs/jfs_metapage.c | 2 +-
fs/reiserfs/stree.c | 2 +-
fs/xfs/linux-2.6/xfs_aops.c | 2 +-
include/linux/page-flags.h | 6 +++---
mm/page-writeback.c | 5 +++--
mm/truncate.c | 4 ++--
10 files changed, 15 insertions(+), 14 deletions(-)
diff -puN fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/buffer.c
--- a/fs/buffer.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/buffer.c
@@ -2858,7 +2858,7 @@ int try_to_free_buffers(struct page *pag
* the page's buffers clean. We discover that here and clean
* the page also.
*/
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 0))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
}
out:
diff -puN fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/fuse/file.c
--- a/fs/fuse/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff -puN fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff -puN fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/jfs/jfs_metapage.c
--- a/fs/jfs/jfs_metapage.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ void release_metapage(struct metapage *
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
ClearPageUptodate(page);
}
#else
diff -puN fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/reiserfs/stree.c
--- a/fs/reiserfs/stree.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff -puN fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/xfs/linux-2.6/xfs_aops.c
--- a/fs/xfs/linux-2.6/xfs_aops.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff -puN include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits include/linux/page-flags.h
--- a/include/linux/page-flags.h~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/include/linux/page-flags.h
@@ -253,13 +253,13 @@ static inline void SetPageUptodate(struc
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff -puN mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/page-writeback.c
--- a/mm/page-writeback.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -866,7 +866,8 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ if (must_clean_ptes)
+ page_mkclean(page);
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
diff -puN mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits mm/truncate.c
--- a/mm/truncate.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 1))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
diff -puN fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits fs/cifs/file.c
--- a/fs/cifs/file.c~try_to_free_buffers-dont-clear-pte-dirty-bits
+++ a/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 1)) {
unlock_page(page);
break;
}
_
On Sun, 17 Dec 2006, Andrew Morton wrote:
>
> So this patch instead arranges for clear_page_dirty() to not clean the pte's
> when it is called on the try_to_free_buffers() path.
No. This is wrong.
It's wrong exactly because it now _breaks_ the whole thing that the 2.6.19
PG_dirty changes were all about: keeping track of dirty pages. Now you
have a page that is dirty, but it's no longer marked PG_dirty, and thus it
doesn't participate in the dirty accounting.
> From my quick reading, all callers of try_to_free_buffers() have already
> unmapped the page from pagetables, and given that the reported ext3 corruption
> happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
> things.
So not only are you breaking this, you also claim that it cannot happen in
the first place. So either the patch is buggy, or it's pointless. In
neither case does it seem to be a good idea to do.
Linus
On Sun, 17 Dec 2006, Andrew Morton wrote:
>
> From my quick reading, all callers of try_to_free_buffers() have already
> unmapped the page from pagetables, and given that the reported ext3 corruption
> happens on uniprocessor, non-preempt kernels, I doubt if this patch will fix
> things.
Hmm. One possible explanation: maybe the page actually _did_ get unmapped
from the page tables, but got added back?
I don't think we lock the page when faulting it in (we want it to be
uptodate, but not necessarily locked). So assuming the pageout sequence
always _does_ follow the rule that it only does try_to_free_buffers() on
pages that aren't mapped, what actually protects them from not becoming
mapped (and dirtied) during that sequence?
So we should probably do a "wait_for_page()" in do_no_page()?
Or maybe only do it for write accesses (since we don't really care about
getting mapped readably)? If so, we need to do it in the write case of
do_no_page() _and_ in the do_wp_page() case. Hmm?
Linus
On Sun, 17 Dec 2006, Linus Torvalds wrote:
>
> So we should probably do a "wait_for_page()" in do_no_page()?
>
> Or maybe only do it for write accesses (since we don't really care about
> getting mapped readably)? If so, we need to do it in the write case of
> do_no_page() _and_ in the do_wp_page() case. Hmm?
I think we discussed doing exactly this at some earlier time, actually,
just to try to throttle people who do lots of page dirtying.
Maybe we even do it somewhere, but I tried to see it, and in the normal
"nopage()" routine we very much try to _avoid_ locking the page (ie if
it's marked PageUptodate() we'll return it whether locked or not). Which
is fine - especially for readers, there really isn't any reason to ever
delay them getting access to a page just because it's locked for write-out
or something (once it's mapped, they'll have access to it regardless of
any locked state in the kernel anyway).
So I don't actually see any serialization at all that would keep a random
page from being paged back in.
Linus
[ Replying to myself - a sure sign that I don't get out enough ]
On Sun, 17 Dec 2006, Linus Torvalds wrote:
>
> So I don't actually see any serialization at all that would keep a random
> page from being paged back in.
We do actually serialize, but we do it _after_ the page has already been
mapped back. Ie we do it for the dirty page case at rthe end of
do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()",
but that's potentially too late - we've already mapped the page read-write
into the address space.
That said, this means that only threaded apps should ever trigger any
problems, which would seem to make it unlikely that this is the issue.
But Andrew: I don't think it's necessarily true that
"try_to_free_buffers()" callers have all unmapped the page.
That seems to be true for vmscan.c (ie the shrink_page_list ->
try_to_release_page -> try_to_release_buffers callchain), but what about
the other callchains (through filesystems, or through "pagevec_strip()" or
similar?) That pagevec_strip() is called from shrink_active_list(), I
don't see that unmapping the pages..
Linus
Linus Torvalds wrote:
> [ Replying to myself - a sure sign that I don't get out enough ]
>
> On Sun, 17 Dec 2006, Linus Torvalds wrote:
>
>>So I don't actually see any serialization at all that would keep a random
>>page from being paged back in.
>
>
> We do actually serialize, but we do it _after_ the page has already been
> mapped back. Ie we do it for the dirty page case at rthe end of
> do_wp_page() and do_no_page() when we do the "set_page_dirty_balance()",
> but that's potentially too late - we've already mapped the page read-write
> into the address space.
I can't see how that's exactly a problem -- so long as the page does not
get reclaimed (it won't, because we have a ref on it) then all that matters
is that the page eventually gets marked dirty.
> That said, this means that only threaded apps should ever trigger any
> problems, which would seem to make it unlikely that this is the issue.
>
> But Andrew: I don't think it's necessarily true that
> "try_to_free_buffers()" callers have all unmapped the page.
>
> That seems to be true for vmscan.c (ie the shrink_page_list ->
> try_to_release_page -> try_to_release_buffers callchain), but what about
> the other callchains (through filesystems, or through "pagevec_strip()" or
> similar?) That pagevec_strip() is called from shrink_active_list(), I
> don't see that unmapping the pages..
Right. But would it really matter whether they are currently mapped or
not, given that we agree it may become mapped at any point?
I think the problem Andrew identified is real.
The issue is the disconnect between the pte dirtiness and a filesystem
bringing buffers clean. But I disagree with his fix, because we don't
actually want to just throw out that pte dirtiness information: we're
just trying to get the PG_dirty bit into synch with what the buffers are
telling us, not actually clean or dirty anything, as such.
Can we clear the page dirty bit, then run set_page_dirty afterwards, if
any dirty ptes are found?
The other thing we might be able to do is to skip doing the
clear_page_dirty if the page is uptodate. This feels more hackish but it
might be faster?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Mon, 18 Dec 2006 15:51:52 +1100
Nick Piggin <[email protected]> wrote:
> I think the problem Andrew identified is real.
I don't. In fact I don't think I described any problem (well, I tried to,
but then I contradicted myself).
Six hours here of fsx-linux plus high memory pressure on SMP on 1k
blocksize ext3, mainline. Zero failures. It's unlikely that this testing
would pass, yet people running normal workloads are able to easily trigger
failures. I suspect we're looking in the wrong place.
> The issue is the disconnect between the pte dirtiness and a filesystem
> bringing buffers clean.
Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty
simple, setting aside races.
In the try_to_free_buffers case there's a large time inverval between
!BH_Dirty and !PG_dirty, but that shouldn't affect anything.
I don't think we even have a theory as to what's gone wrong yet.
On Mon, 18 Dec 2006, Nick Piggin wrote:
>
> I can't see how that's exactly a problem -- so long as the page does not
> get reclaimed (it won't, because we have a ref on it) then all that matters
> is that the page eventually gets marked dirty.
But the point being that "try_to_free_buffers()" marks it clean
AFTERWARDS.
So yes, the page gets marked dirty in the pte's - the hardware generally
does that for us, so we don't have to worry about that part going on.
But "try_to_free_buffers()" seems to clear those dirty bits without
serializing it really any way. It just says "ok, I will now clear them".
Without knowing whether the dirty bits got set before the IO that cleared
the buffer head dirty bits or not.
What is _that_ serialization? As far as I can see, the only way to
guarantee that to happen (since the dirty bits in the page tables will get
set without us ever even being notified) is that the page tables
themselves must simply never contain that page in a writable form at all.
And that seems to be lacking.
Anyway, I have what I consider a much simpler solution: just don't DO all
that crap in try_to_free_buffers() at all. I sent it out to some people
already, not not very widely.
I reproduce my suggestion here for you (and maybe others too who weren't
cc'd in that other discussion group) to comment on..
Linus
---
So I think your patch is really broken, how about this one instead?
It's really my previous patch, BUT it also adds a
if (PageDirty(page) ..
return 0;
case, on the assumption that since PageDirty() measn that one of the
buffers should be dirty, there's no point in even _trying_ drop_buffers,
since that should just fail anyway.
Now, that assumption is obviously wrong _if_ the buffers have been cleaned
by something else. So in that case, we now don't remove the buffer heads,
but who really cares? The page will remain on the dirty list, and
something should be trying to write it out, but since now all the buffers
are clean, once that happens, there is no actual IO to happen.
Hmm? So this means that we simply don't remove the buffers early from such
pages, but there shouldn't be any real downside.
Now, the only question would be if the page is marked dirty _while_ this
is running. We do hold the page lock, but page dirtying doesn't get the
lock, does it? But at least we won't mark the page _clean_ when it
shouldn't be.. And we still are atomic wrt the actual buffer lists
(mapping->private_lock), so I think this should all be ok, and
drop_buffers() will do the right thing.
So no race possible either.
At least as far as I can see. And the patch certainly is simple.
Now the question whether this actually _fixes_ any problems does remain,
but I think this should be a pretty good solution if the bug really is
here. Andrew?
Linus
----
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
On Sun, 17 Dec 2006 21:50:43 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> >
> > I can't see how that's exactly a problem -- so long as the page does not
> > get reclaimed (it won't, because we have a ref on it) then all that matters
> > is that the page eventually gets marked dirty.
>
> But the point being that "try_to_free_buffers()" marks it clean
> AFTERWARDS.
>
> So yes, the page gets marked dirty in the pte's - the hardware generally
> does that for us, so we don't have to worry about that part going on.
>
> But "try_to_free_buffers()" seems to clear those dirty bits without
> serializing it really any way. It just says "ok, I will now clear them".
> Without knowing whether the dirty bits got set before the IO that cleared
> the buffer head dirty bits or not.
Yes, I can't see anything correct about the current behaviour.
But I'm going blue in the face here trying to feed try_to_free_buffers() a
page_mapped(page), without success. pagevec_strip() presumably isn't
triggering.
> What is _that_ serialization? As far as I can see, the only way to
> guarantee that to happen (since the dirty bits in the page tables will get
> set without us ever even being notified) is that the page tables
> themselves must simply never contain that page in a writable form at all.
>
> And that seems to be lacking.
>
> Anyway, I have what I consider a much simpler solution: just don't DO all
> that crap in try_to_free_buffers() at all. I sent it out to some people
> already, not not very widely.
>
> I reproduce my suggestion here for you (and maybe others too who weren't
> cc'd in that other discussion group) to comment on..
>
> ...
>
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
> int ret = 0;
>
> BUG_ON(!PageLocked(page));
> - if (PageWriteback(page))
> + if (PageDirty(page) || PageWriteback(page))
> return 0;
>
> if (mapping == NULL) { /* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
> - if (ret) {
> - /*
> - * If the filesystem writes its buffers by hand (eg ext3)
> - * then we can have clean buffers against a dirty page. We
> - * clean the page here; otherwise later reattachment of buffers
> - * could encounter a non-uptodate page, which is unresolvable.
> - * This only applies in the rare case where try_to_free_buffers
> - * succeeds but the page is not freed.
> - *
> - * Also, during truncate, discard_buffer will have marked all
> - * the page's buffers clean. We discover that here and clean
> - * the page also.
> - */
> - if (test_clear_page_dirty(page))
> - task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> - }
> out:
> if (buffers_to_free) {
> struct buffer_head *bh = buffers_to_free;
This will (at least) cause truncate to do peculiar things.
do_invalidatepage() runs discard_buffer() against the dirty page and will
then expect try_to_free_buffers() to remove those buffers and then clean
the page. truncate_complete_page() will clean the page, but it still has
those invalidated buffers. We'll end up with a large number of clean,
unused pages on the LRU, with attached buffers. These should eventually
get reaped, but it'll change the page aging dynamics.
On Sun, 17 Dec 2006 23:16:17 -0800
Andrew Morton <[email protected]> wrote:
> > out:
> > if (buffers_to_free) {
> > struct buffer_head *bh = buffers_to_free;
>
> This will (at least) cause truncate to do peculiar things.
> do_invalidatepage() runs discard_buffer() against the dirty page and will
> then expect try_to_free_buffers() to remove those buffers and then clean
> the page. truncate_complete_page() will clean the page, but it still has
> those invalidated buffers. We'll end up with a large number of clean,
> unused pages on the LRU, with attached buffers. These should eventually
> get reaped, but it'll change the page aging dynamics.
That being said, it's be great to get this tested by someone who can
trigger this bug, please.
Linus Torvalds wrote:
>
> On Mon, 18 Dec 2006, Nick Piggin wrote:
>
>>I can't see how that's exactly a problem -- so long as the page does not
>>get reclaimed (it won't, because we have a ref on it) then all that matters
>>is that the page eventually gets marked dirty.
>
>
> But the point being that "try_to_free_buffers()" marks it clean
> AFTERWARDS.
For some reason I thought you were suggesting it is a problem on its own :P
Yes I agree there is a pagefault vs ttfb race.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Andrew Morton wrote:
> On Mon, 18 Dec 2006 15:51:52 +1100
> Nick Piggin <[email protected]> wrote:
>
>
>>I think the problem Andrew identified is real.
>
>
> I don't. In fact I don't think I described any problem (well, I tried to,
> but then I contradicted myself).
By saying that there shouldn't be any dirty ptes if there are no
dirty buffers? But in that case the _page_ shouldn't be dirty either,
so that clear_page_dirty would be redundant. But presumably it isn't.
> Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> blocksize ext3, mainline. Zero failures. It's unlikely that this testing
> would pass, yet people running normal workloads are able to easily trigger
> failures. I suspect we're looking in the wrong place.
Yes I could believe it the corruption is caused by something else
completely.
>>The issue is the disconnect between the pte dirtiness and a filesystem
>>bringing buffers clean.
>
>
> Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty
> simple, setting aside races.
>
> In the try_to_free_buffers case there's a large time inverval between
> !BH_Dirty and !PG_dirty, but that shouldn't affect anything.
After try_to_free_buffers detaches the buffers from the page, a
pagefault can come in, and mark the pte writeable, then set_page_dirty
(which finds no buffers, so only sets PG_dirty).
The page can now get dirtied through this mapping.
try_to_free_buffers then goes on to clean the page and ptes.
I really thought you were the one who identified this race, and I didn't
see where you showed it is safe.
It may be very unlikely with small SMPs, but less so with preempt. All
we have to do is preempt at spin_unlock in try_to_free_buffers AFAIKS.
Were you testing with preempt?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Mon, 18 Dec 2006 18:22:42 +1100
Nick Piggin <[email protected]> wrote:
> Andrew Morton wrote:
> > On Mon, 18 Dec 2006 15:51:52 +1100
> > Nick Piggin <[email protected]> wrote:
> >
> >
> >>I think the problem Andrew identified is real.
> >
> >
> > I don't. In fact I don't think I described any problem (well, I tried to,
> > but then I contradicted myself).
>
> By saying that there shouldn't be any dirty ptes if there are no
> dirty buffers? But in that case the _page_ shouldn't be dirty either,
> so that clear_page_dirty would be redundant. But presumably it isn't.
I don't follow that.
The linkage between pte-dirtiness and buffer_heads is a bit hard to follow
without also considering page-dirtiness.
> > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > blocksize ext3, mainline. Zero failures. It's unlikely that this testing
> > would pass, yet people running normal workloads are able to easily trigger
> > failures. I suspect we're looking in the wrong place.
>
> Yes I could believe it the corruption is caused by something else
> completely.
Think so. We do have a problem here, but only on threaded apps, I believe.
rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
UP.
> >>The issue is the disconnect between the pte dirtiness and a filesystem
> >>bringing buffers clean.
> >
> >
> > Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty
> > simple, setting aside races.
> >
> > In the try_to_free_buffers case there's a large time inverval between
> > !BH_Dirty and !PG_dirty, but that shouldn't affect anything.
>
> After try_to_free_buffers detaches the buffers from the page, a
> pagefault can come in, and mark the pte writeable, then set_page_dirty
> (which finds no buffers, so only sets PG_dirty).
>
> The page can now get dirtied through this mapping.
>
> try_to_free_buffers then goes on to clean the page and ptes.
try_to_free_buffers() isn't called against a page which doesn't have
buffers. It'll oops.
> Were you testing with preempt?
nope, just SMP.
I tried latest git with the patch from this email and it still get file
content corruption. If I can help you further debug the problem tell me
what to do.
On Sun, 2006-12-17 at 21:50 -0800, Linus Torvalds wrote:
>
> On Mon, 18 Dec 2006, Nick Piggin wrote:
> >
> > I can't see how that's exactly a problem -- so long as the page does not
> > get reclaimed (it won't, because we have a ref on it) then all that matters
> > is that the page eventually gets marked dirty.
>
> But the point being that "try_to_free_buffers()" marks it clean
> AFTERWARDS.
>
> So yes, the page gets marked dirty in the pte's - the hardware generally
> does that for us, so we don't have to worry about that part going on.
>
> But "try_to_free_buffers()" seems to clear those dirty bits without
> serializing it really any way. It just says "ok, I will now clear them".
> Without knowing whether the dirty bits got set before the IO that cleared
> the buffer head dirty bits or not.
>
> What is _that_ serialization? As far as I can see, the only way to
> guarantee that to happen (since the dirty bits in the page tables will get
> set without us ever even being notified) is that the page tables
> themselves must simply never contain that page in a writable form at all.
>
> And that seems to be lacking.
>
> Anyway, I have what I consider a much simpler solution: just don't DO all
> that crap in try_to_free_buffers() at all. I sent it out to some people
> already, not not very widely.
>
> I reproduce my suggestion here for you (and maybe others too who weren't
> cc'd in that other discussion group) to comment on..
>
> Linus
>
> ---
>
> So I think your patch is really broken, how about this one instead?
>
> It's really my previous patch, BUT it also adds a
>
> if (PageDirty(page) ..
> return 0;
>
> case, on the assumption that since PageDirty() measn that one of the
> buffers should be dirty, there's no point in even _trying_ drop_buffers,
> since that should just fail anyway.
>
> Now, that assumption is obviously wrong _if_ the buffers have been cleaned
> by something else. So in that case, we now don't remove the buffer heads,
> but who really cares? The page will remain on the dirty list, and
> something should be trying to write it out, but since now all the buffers
> are clean, once that happens, there is no actual IO to happen.
>
> Hmm? So this means that we simply don't remove the buffers early from such
> pages, but there shouldn't be any real downside.
>
> Now, the only question would be if the page is marked dirty _while_ this
> is running. We do hold the page lock, but page dirtying doesn't get the
> lock, does it? But at least we won't mark the page _clean_ when it
> shouldn't be.. And we still are atomic wrt the actual buffer lists
> (mapping->private_lock), so I think this should all be ok, and
> drop_buffers() will do the right thing.
>
> So no race possible either.
>
> At least as far as I can see. And the patch certainly is simple.
>
> Now the question whether this actually _fixes_ any problems does remain,
> but I think this should be a pretty good solution if the bug really is
> here. Andrew?
>
> Linus
>
> ----
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
> int ret = 0;
>
> BUG_ON(!PageLocked(page));
> - if (PageWriteback(page))
> + if (PageDirty(page) || PageWriteback(page))
> return 0;
>
> if (mapping == NULL) { /* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
> - if (ret) {
> - /*
> - * If the filesystem writes its buffers by hand (eg ext3)
> - * then we can have clean buffers against a dirty page. We
> - * clean the page here; otherwise later reattachment of buffers
> - * could encounter a non-uptodate page, which is unresolvable.
> - * This only applies in the rare case where try_to_free_buffers
> - * succeeds but the page is not freed.
> - *
> - * Also, during truncate, discard_buffer will have marked all
> - * the page's buffers clean. We discover that here and clean
> - * the page also.
> - */
> - if (test_clear_page_dirty(page))
> - task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> - }
> out:
> if (buffers_to_free) {
> struct buffer_head *bh = buffers_to_free;
>
On Mon, 2006-12-18 at 01:18 -0800, Andrew Morton wrote:
> On Mon, 18 Dec 2006 18:22:42 +1100
> Nick Piggin <[email protected]> wrote:
>
> > Andrew Morton wrote:
> > > On Mon, 18 Dec 2006 15:51:52 +1100
> > > Nick Piggin <[email protected]> wrote:
> > >
> > >
> > >>I think the problem Andrew identified is real.
> > >
> > >
> > > I don't. In fact I don't think I described any problem (well, I tried to,
> > > but then I contradicted myself).
> >
> > By saying that there shouldn't be any dirty ptes if there are no
> > dirty buffers? But in that case the _page_ shouldn't be dirty either,
> > so that clear_page_dirty would be redundant. But presumably it isn't.
>
> I don't follow that.
>
> The linkage between pte-dirtiness and buffer_heads is a bit hard to follow
> without also considering page-dirtiness.
>
> > > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > > blocksize ext3, mainline. Zero failures. It's unlikely that this testing
> > > would pass, yet people running normal workloads are able to easily trigger
> > > failures. I suspect we're looking in the wrong place.
> >
> > Yes I could believe it the corruption is caused by something else
> > completely.
>
> Think so. We do have a problem here, but only on threaded apps, I believe.
> rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
> UP.
ierdnac ~ # uname -a
Linux ierdnac 2.6.20-rc1 #2 SMP PREEMPT Mon Dec 18 11:01:52 EET 2006
i686 Genuine Intel(R) CPU T2050 @ 1.60GHz GenuineIntel
GNU/Linux
and the other person who had corruption with rtorrent has also SMP and
PREEMPT.
>
> > >>The issue is the disconnect between the pte dirtiness and a filesystem
> > >>bringing buffers clean.
> > >
> > >
> > > Really? The dirtying direction goes pte_dirty->PG_dirty->BH_Dirty and the
> > > cleaning direction goes !BH_Dirty->!PG_dirty->!pte_dirty. That's pretty
> > > simple, setting aside races.
> > >
> > > In the try_to_free_buffers case there's a large time inverval between
> > > !BH_Dirty and !PG_dirty, but that shouldn't affect anything.
> >
> > After try_to_free_buffers detaches the buffers from the page, a
> > pagefault can come in, and mark the pte writeable, then set_page_dirty
> > (which finds no buffers, so only sets PG_dirty).
> >
> > The page can now get dirtied through this mapping.
> >
> > try_to_free_buffers then goes on to clean the page and ptes.
>
> try_to_free_buffers() isn't called against a page which doesn't have
> buffers. It'll oops.
>
> > Were you testing with preempt?
>
> nope, just SMP.
>
On Mon, 18 Dec 2006 11:19:04 +0200
Andrei Popa <[email protected]> wrote:
>
> I tried latest git with the patch from this email and it still get file
> content corruption. If I can help you further debug the problem tell me
> what to do.
Can you please tell us all the steps which we need to take to reproduce this?
On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote:
> On Mon, 18 Dec 2006 11:19:04 +0200
> Andrei Popa <[email protected]> wrote:
>
> >
> > I tried latest git with the patch from this email and it still get file
> > content corruption. If I can help you further debug the problem tell me
> > what to do.
>
> Can you please tell us all the steps which we need to take to reproduce this?
I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent
with multiple files(I downloaded 84 rar files) and when it will finish
it will do a hash check and at the end of the check will say "Hash check
on download completion found bad chunks, consider using "safe_sync"."
and stop and most of the downloaded files are broken. With Peter
Zijlstra patch this error doesn't show but there is file
corruption(although less files are corrupted); afther the hash check,
rtorrent will download the bad chunks and do another hash check and all
files are ok.
On Mon, 2006-12-18 at 12:00 +0200, Andrei Popa wrote:
> On Mon, 2006-12-18 at 01:38 -0800, Andrew Morton wrote:
> > On Mon, 18 Dec 2006 11:19:04 +0200
> > Andrei Popa <[email protected]> wrote:
> >
> > >
> > > I tried latest git with the patch from this email and it still get file
> > > content corruption. If I can help you further debug the problem tell me
> > > what to do.
> >
> > Can you please tell us all the steps which we need to take to reproduce this?
>
> I'm using rtorrent-0.7.0 and libtorrent-0.11.0, just download a torrent
> with multiple files(I downloaded 84 rar files) and when it will finish
> it will do a hash check and at the end of the check will say "Hash check
> on download completion found bad chunks, consider using "safe_sync"."
> and stop and most of the downloaded files are broken. With Peter
> Zijlstra patch this error doesn't show but there is file
> corruption(although less files are corrupted); afther the hash check,
> rtorrent will download the bad chunks and do another hash check and all
> files are ok.
OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3
in?
Also, for testings sake, could you give this a go:
It's a total hack but I guess worth testing.
---
mm/rmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6-git/mm/rmap.c
===================================================================
--- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100
@@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
goto unlock;
entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
+ /* entry = pte_mkclean(entry); */
entry = pte_wrprotect(entry);
ptep_establish(vma, address, pte, entry);
lazy_mmu_prot_update(entry);
Andrew Morton wrote:
> On Sun, 17 Dec 2006 21:50:43 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
>
>>
>>On Mon, 18 Dec 2006, Nick Piggin wrote:
>>
>>>I can't see how that's exactly a problem -- so long as the page does not
>>>get reclaimed (it won't, because we have a ref on it) then all that matters
>>>is that the page eventually gets marked dirty.
>>
>>But the point being that "try_to_free_buffers()" marks it clean
>>AFTERWARDS.
>>
>>So yes, the page gets marked dirty in the pte's - the hardware generally
>>does that for us, so we don't have to worry about that part going on.
>>
>>But "try_to_free_buffers()" seems to clear those dirty bits without
>>serializing it really any way. It just says "ok, I will now clear them".
>>Without knowing whether the dirty bits got set before the IO that cleared
>>the buffer head dirty bits or not.
>
>
> Yes, I can't see anything correct about the current behaviour.
>
> But I'm going blue in the face here trying to feed try_to_free_buffers() a
> page_mapped(page), without success. pagevec_strip() presumably isn't
> triggering.
I can trigger it here, with a kernel patch to call pagevec_strip
unconditionally. I am seeing it clearing pte dirty bits, which is surely
a dataloss bug.
BUG: warning at mm/page-writeback.c:862/clear_page_dirty_warn()
[<c013f65a>] clear_page_dirty_warn+0xdb/0xdd
[<c0174309>] try_to_free_buffers+0x6b/0x7e
[<c01937ec>] ext3_releasepage+0x0/0x74
[<c013bb48>] try_to_release_page+0x2c/0x40
[<c0140925>] pagevec_strip+0x52/0x54
[<c0141580>] shrink_active_list+0x2a0/0x3c8
[<c0142100>] shrink_zone+0xcd/0xea
[<c014266d>] kswapd+0x311/0x41e
[<c012c6aa>] autoremove_wake_function+0x0/0x37
[<c014235c>] kswapd+0x0/0x41e
[<c012c527>] kthread+0xde/0xe2
[<c012c449>] kthread+0x0/0xe2
[<c010395b>] kernel_thread_helper+0x7/0x1c
=======================
(clear_page_dirty_warn() is test_clear_page_dirty which WARN_ON()s the
result of page_mkclean)
> This will (at least) cause truncate to do peculiar things.
> do_invalidatepage() runs discard_buffer() against the dirty page and will
> then expect try_to_free_buffers() to remove those buffers and then clean
> the page. truncate_complete_page() will clean the page, but it still has
> those invalidated buffers. We'll end up with a large number of clean,
> unused pages on the LRU, with attached buffers. These should eventually
> get reaped, but it'll change the page aging dynamics.
This isn't so nice. I wonder if you could just ClearPageDirty before
calling try_to_free_buffers in this case, or is that too much of a
hack? Ideally I guess you want a variant that is happy to discard
dirtiness (alternatively, my proposal to redirty the page if we find
a dirty pte should also handle this).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Andrew Morton wrote:
> On Mon, 18 Dec 2006 18:22:42 +1100
> Nick Piggin <[email protected]> wrote:
>>Yes I could believe it the corruption is caused by something else
>>completely.
>
>
> Think so. We do have a problem here, but only on threaded apps, I believe.
> rtorrent doesn't appear to be threaded, and the bug is hit on non-preempt
> UP.
I think (see below) that it does not apply only to threaded apps. But
it would need one of SMP or PREEMPT to trigger.
>>After try_to_free_buffers detaches the buffers from the page, a
>>pagefault can come in, and mark the pte writeable, then set_page_dirty
>>(which finds no buffers, so only sets PG_dirty).
>>
>>The page can now get dirtied through this mapping.
>>
>>try_to_free_buffers then goes on to clean the page and ptes.
>
>
> try_to_free_buffers() isn't called against a page which doesn't have
> buffers. It'll oops.
Sure. But I think the race exists... I'll try spelling it out in
the conventional way:
try_to_free_buffers()
drop_buffers() (succeeds)
** preempt here or run right-hand thread on 2nd CPU in SMP **
do_no_page()
set_page_dirty()
[now modify the page via this mapping
(from this process or a concurrent thread)]
clear_page_dirty() (clears PG_dirty + pte dirty, oops)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
> OK, I'll try this on a ext3 box. BTW, what data mode are you using ext3
> in?
>
ordered
>
> Also, for testings sake, could you give this a go:
> It's a total hack but I guess worth testing.
>
> ---
> mm/rmap.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6-git/mm/rmap.c
> ===================================================================
> --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100
> +++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> goto unlock;
>
> entry = ptep_get_and_clear(mm, address, pte);
> - entry = pte_mkclean(entry);
> + /* entry = pte_mkclean(entry); */
> entry = pte_wrprotect(entry);
> ptep_establish(vma, address, pte, entry);
> lazy_mmu_prot_update(entry);
>
with latest git and this patch there is no corruption !
On Monday 18 December 2006 05:49, Andrei Popa wrote:
>> OK, I'll try this on a ext3 box. BTW, what data mode are you using
>> ext3 in?
>
>ordered
>
>> Also, for testings sake, could you give this a go:
>> It's a total hack but I guess worth testing.
>>
>> ---
>> mm/rmap.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> Index: linux-2.6-git/mm/rmap.c
>> ===================================================================
>> --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100
>> +++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100
>> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
>> goto unlock;
>>
>> entry = ptep_get_and_clear(mm, address, pte);
>> - entry = pte_mkclean(entry);
>> + /* entry = pte_mkclean(entry); */
>> entry = pte_wrprotect(entry);
>> ptep_establish(vma, address, pte, entry);
>> lazy_mmu_prot_update(entry);
>
>with latest git and this patch there is no corruption !
>
I've not run a torrent app here recently. Should this patch be applied to
a plain 2.6-20-rc1 before I do run azureas or similar apps?
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.
On Mon, 2006-12-18 at 10:24 -0500, Gene Heskett wrote:
> On Monday 18 December 2006 05:49, Andrei Popa wrote:
> >> OK, I'll try this on a ext3 box. BTW, what data mode are you using
> >> ext3 in?
> >
> >ordered
> >
> >> Also, for testings sake, could you give this a go:
> >> It's a total hack but I guess worth testing.
> >>
> >> ---
> >> mm/rmap.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> Index: linux-2.6-git/mm/rmap.c
> >> ===================================================================
> >> --- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100
> >> +++ linux-2.6-git/mm/rmap.c 2006-12-18 11:07:16.000000000 +0100
> >> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> >> goto unlock;
> >>
> >> entry = ptep_get_and_clear(mm, address, pte);
> >> - entry = pte_mkclean(entry);
> >> + /* entry = pte_mkclean(entry); */
> >> entry = pte_wrprotect(entry);
> >> ptep_establish(vma, address, pte, entry);
> >> lazy_mmu_prot_update(entry);
> >
> >with latest git and this patch there is no corruption !
> >
> I've not run a torrent app here recently. Should this patch be applied to
> a plain 2.6-20-rc1 before I do run azureas or similar apps?
depends on what the blue frog does, if it uses MAP_SHARED like rtorrent
does then yeah, probably. This patch really should not be the final one,
I'm currently still trying to wrap my head around the issue. That said,
it should be safe to use :-)
On Monday 18 December 2006 10:32, Peter Zijlstra wrote:
[...]
>>
>> I've not run a torrent app here recently. Should this patch be
>> applied to a plain 2.6-20-rc1 before I do run azureas or similar apps?
>
>depends on what the blue frog does, if it uses MAP_SHARED like rtorrent
>does then yeah, probably. This patch really should not be the final one,
>I'm currently still trying to wrap my head around the issue. That said,
>it should be safe to use :-)
>
Thanks, I'll do it.
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.
On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
> On Sun, 17 Dec 2006 15:39:32 +0200
> Andrei Popa <[email protected]> wrote:
>
> > I was mistaken, I'm still having file corruption with rtorrent.
> >
>
> Well I'm not very optimistic, but if people could try this, please...
>
>
>
> From: Andrew Morton <[email protected]>
>
> try_to_free_buffers() clears the page's dirty state if it successfully removed
> the page's buffers.
>
> Background for this:
>
> - a process does a one-byte-write to a file on a 64k pagesize, 4k
> blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and
> has one dirty buffer and 15 not uptodate buffers.
>
> - kjournald writes the dirty buffer. The page is now PageDirty,
> !PageUptodate and has a mix of clean and not uptodate buffers.
>
> - try_to_free_buffers() removes the page's buffers. It MUST now clear
> PageDirty. If we were to leave the page dirty then we'd have a dirty, not
> uptodate page with no buffer_heads.
>
> We're screwed: we cannot write the page because we don't know which
> sections of it contain garbage. We cannot read the page because we don't
> know which sections of it contain modified data. We cannot free the page
> because it is dirty.
>
How about we stick something like this on top of that patch. It should
preserve the dirty state as required.
I tried to tinker with avoiding the clear/set thing but could not
convince myself it was close to safe.
This should be safe; page_mkclean walks the rmap and flips the pte's
under the pte lock and records the dirty state while iterating.
Concurrent faults will either do set_page_dirty() before we get around
to doing it or vice versa, but dirty state is not lost.
---
mm/page-writeback.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c 2006-12-18 17:24:41.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c 2006-12-18 17:26:56.000000000 +0100
@@ -872,8 +872,9 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- if (must_clean_ptes)
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned)
+ set_page_dirty(page);
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
Andrei,
could you try Peter's patch (on top of Andrew's patch - it depends on
it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON()
I mention in this email? You seem to be able to reproduce this easily..
Thanks)
On Mon, 18 Dec 2006, Peter Zijlstra wrote:
>
> This should be safe; page_mkclean walks the rmap and flips the pte's
> under the pte lock and records the dirty state while iterating.
> Concurrent faults will either do set_page_dirty() before we get around
> to doing it or vice versa, but dirty state is not lost.
Ok, I really liked this patch, but the more I thought about it, the more I
started to doubt the reasons for liking it.
I think we have some core fundamental problem here that this patch is
needed at all.
So let's think about this: we apparently have two cases of
"clear_page_dirty()":
- the one that really wants to clear the bit unconditionally (Andrew
calls this the "must_clean_ptes" case, which I personally find to be a
really confusing name, but whatever)
- the other case. The case that doesn't want to really clear the pte
dirty bits.
and I thought your patch made sense, because it saved away the pte state
in the page dirty state, and that matches my mental model, but the more I
think about it, the less sense that whole "the other case" situation makes
AT ALL.
Why does "the other case" exist at all? If you want to clear the dirty
page flag, what is _ever_ the reason for not wanting to drop PTE dirty
information? In other words, what possible reason can there ever be for
saying "I want this page to be clean", while at the same time saying "but
if it was dirty in the page tables, don't forget about that state".
So I absolutely detested Andrew's original patch, because that one made
zero sense at all even from a code standpoint. With your patch on top, it
all suddenly makes sense: at least you don't just leave dirty pages in the
PTE's with a "struct page" that is marked clean, and the end result is
undeniably at least _consistent_.
So Andrew's patch I can't stand, because the whole point of it seems to be
to leave the system in an inconsistent state (dirty in the pte's but
marked "clean"), and if we want to have that state, then we should just
revert _everything_ to the 2.6.18 situation, and not play these games at
all.
Andrew's patch with your patch on top makes me happy, because now we're
at least honoring all the basic rules (we don't get into an inconsistent
state), so on a local level it all makes sense. HOWEVER, I then don't
actually understand how it could ever actually make sense to ask for
"please clean the page, but don't actually clean it".
So _I_ think that we should add a honking huge WARN_ON() for this case.
Ie, do your patch, but instead of re-dirtying the page:
+ if (!must_clean_ptes && cleaned)
+ set_page_dirty(page);
we would do
+ if (!must_clean_ptes && cleaned) {
+ WARN_ON(1);
+ set_page_dirty(page);
+ }
and ask the people who see this problem to see if they get the WARN_ON()
(assuming it _fixes_ their data corruption).
Because whoever calls "clean_dirty_page()" without actually wanting to
clean the PTE's is really a bug: those dirty PTE's had better not exist.
Or maybe the WARN_ON() just points out _why_ somebody would want to do
something this insane. Right now I just can't see why it's a valid thing
to do.
Maybe I'm still confused.
Linus
On Mon, 2006-12-18 at 10:03 -0800, Linus Torvalds wrote:
> Andrei,
> could you try Peter's patch (on top of Andrew's patch - it depends on
> it, and wouldn't work on an unmodified -git kernel, but add the WARN_ON()
> I mention in this email? You seem to be able to reproduce this easily..
> Thanks)
I finally beat yum into submission and I hope to have rtorrent compiled
shortly.
> On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> >
> > This should be safe; page_mkclean walks the rmap and flips the pte's
> > under the pte lock and records the dirty state while iterating.
> > Concurrent faults will either do set_page_dirty() before we get around
> > to doing it or vice versa, but dirty state is not lost.
>
> Ok, I really liked this patch, but the more I thought about it, the more I
> started to doubt the reasons for liking it.
>
> I think we have some core fundamental problem here that this patch is
> needed at all.
I agree, but I suspect this is like the buffered write deadlock Nick is
working on, in that it will require some proper filesystem surgery to
get right. Having the kernel working in the meantime has my
preference ;-)
> So let's think about this: we apparently have two cases of
> "clear_page_dirty()":
>
> - the one that really wants to clear the bit unconditionally (Andrew
> calls this the "must_clean_ptes" case, which I personally find to be a
> really confusing name, but whatever)
I'm probably worse with names so I'm not even going to try and fix that.
> - the other case. The case that doesn't want to really clear the pte
> dirty bits.
>
> and I thought your patch made sense, because it saved away the pte state
> in the page dirty state, and that matches my mental model, but the more I
> think about it, the less sense that whole "the other case" situation makes
> AT ALL.
>
> Why does "the other case" exist at all? If you want to clear the dirty
> page flag, what is _ever_ the reason for not wanting to drop PTE dirty
> information? In other words, what possible reason can there ever be for
> saying "I want this page to be clean", while at the same time saying "but
> if it was dirty in the page tables, don't forget about that state".
I have tried to get my head around this, and have so far failed. Andrews
mail with the patch (great-grandparent to this mail) was the one that
made most sense explaining it afaics.
> So I absolutely detested Andrew's original patch, because that one made
> zero sense at all even from a code standpoint. With your patch on top, it
> all suddenly makes sense: at least you don't just leave dirty pages in the
> PTE's with a "struct page" that is marked clean, and the end result is
> undeniably at least _consistent_.
>
> So Andrew's patch I can't stand, because the whole point of it seems to be
> to leave the system in an inconsistent state (dirty in the pte's but
> marked "clean"), and if we want to have that state, then we should just
> revert _everything_ to the 2.6.18 situation, and not play these games at
> all.
>
> Andrew's patch with your patch on top makes me happy, because now we're
> at least honoring all the basic rules (we don't get into an inconsistent
> state), so on a local level it all makes sense. HOWEVER, I then don't
> actually understand how it could ever actually make sense to ask for
> "please clean the page, but don't actually clean it".
Somehow it looses track of actual page content dirtyness when it does
the page buffer game.
Is this because page buffers are used to do sub-page sized writes
without RMW cycles?
Cannot this case be avoided when the page is mapped, because at that
point the whole page will be resident anyway.
> So _I_ think that we should add a honking huge WARN_ON() for this case.
> Ie, do your patch, but instead of re-dirtying the page:
>
> + if (!must_clean_ptes && cleaned)
> + set_page_dirty(page);
>
> we would do
>
> + if (!must_clean_ptes && cleaned) {
> + WARN_ON(1);
> + set_page_dirty(page);
> + }
>
> and ask the people who see this problem to see if they get the WARN_ON()
> (assuming it _fixes_ their data corruption).
>
> Because whoever calls "clean_dirty_page()" without actually wanting to
> clean the PTE's is really a bug: those dirty PTE's had better not exist.
>
> Or maybe the WARN_ON() just points out _why_ somebody would want to do
> something this insane. Right now I just can't see why it's a valid thing
> to do.
Maybe, but I think Nick's mail here:
http://lkml.org/lkml/2006/12/18/59
shows a trace like that. I'm guessing that if we do the WARN_ON() some
folks might get a lot of output, perhaps WARN_ON_ONCE() ?
On Mon, 18 Dec 2006, Peter Zijlstra wrote:
> >
> > Or maybe the WARN_ON() just points out _why_ somebody would want to do
> > something this insane. Right now I just can't see why it's a valid thing
> > to do.
>
> Maybe, but I think Nick's mail here:
> http://lkml.org/lkml/2006/12/18/59
>
> shows a trace like that.
Sure, but I actually think that "try_to_free_buffers()" was buggy in the
first place, shouldn't have done what it did at all (it has NO business
clearing dirty data), and should be fixed with my other simple and clean
patch that just removes the crap.
But sadly, Andrei said that he still saw data corruption, which implies
that the problem had nothing to do with "try_to_free_buffers()" at all.
(On that note: Andrei - if you do test this out, I'd suggest applying my
patch too - the one that you already tested. It won't apply cleanly on top
of Andrew's patch, but it should be trivial to apply by hand, since you
really just want to remove the whole "if (ret) {...}" sequence. I realize
that it didn't make any difference for you, but applying that patch is
probably a good idea just to remove the noise for a codepath that you
already showed to not matter)
> I'm guessing that if we do the WARN_ON() some folks might get a lot of
> output, perhaps WARN_ON_ONCE() ?
Well, I'd rather get lots of noise to see all the paths that can cause
this. We've been concentrating mainly on one (try_to_free_buffers()), but
that one was already shown not to matter or at least not to be the _whole_
issue, so..
Linus
> (On that note: Andrei - if you do test this out, I'd suggest applying my
> patch too - the one that you already tested. It won't apply cleanly on top
> of Andrew's patch, but it should be trivial to apply by hand, since you
> really just want to remove the whole "if (ret) {...}" sequence. I realize
> that it didn't make any difference for you, but applying that patch is
> probably a good idea just to remove the noise for a codepath that you
> already showed to not matter)
I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
two). All unified patch is attached. I tested and I have no corruption.
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 1)) {
unlock_page(page);
break;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
ClearPageUptodate(page);
}
#else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..561d702 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory
accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -866,7 +866,9 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned)
+ set_page_dirty(page);
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..3f9061e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
goto unlock;
entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
+ /*entry = pte_mkclean(entry);*/
entry = pte_wrprotect(entry);
ptep_establish(vma, address, pte, entry);
lazy_mmu_prot_update(entry);
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 1))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
>
> > I'm guessing that if we do the WARN_ON() some folks might get a lot of
> > output, perhaps WARN_ON_ONCE() ?
>
> Well, I'd rather get lots of noise to see all the paths that can cause
> this. We've been concentrating mainly on one (try_to_free_buffers()), but
> that one was already shown not to matter or at least not to be the _whole_
> issue, so..
>
> Linus
On Mon, 2006-12-18 at 21:04 +0200, Andrei Popa wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..3f9061e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> goto unlock;
>
> entry = ptep_get_and_clear(mm, address, pte);
> - entry = pte_mkclean(entry);
> + /*entry = pte_mkclean(entry);*/
> entry = pte_wrprotect(entry);
> ptep_establish(vma, address, pte, entry);
> lazy_mmu_prot_update(entry);
please drop this chunk, this will always make the problem go away.
On Mon, 18 Dec 2006, Andrei Popa wrote:
>
> I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
> two). All unified patch is attached. I tested and I have no corruption.
That wasn't very interesting, because you also had the patch that just
disabled "page_mkclean_one()" entirely:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..3f9061e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> goto unlock;
>
> entry = ptep_get_and_clear(mm, address, pte);
> - entry = pte_mkclean(entry);
> + /*entry = pte_mkclean(entry);*/
> entry = pte_wrprotect(entry);
> ptep_establish(vma, address, pte, entry);
> lazy_mmu_prot_update(entry);
The above patch is bad. It's always going to hide the bug, but it hides it
by just not doing anything at all. So any patch combination that contains
that patch will probably _always_ fix your problem, but it won't be an
interesting patch..
So can you remove that small fragment? Also, it would be nice if you added
the WARN_ON() to this sequence in mm/page-writeback.c:
+ if (!must_clean_ptes && cleaned)
+ set_page_dirty(page);
just make it do a WARN_ON() if this ever triggers.
Then, IF the corruption is gone, we'd love to see the WARN_ON results..
Linus
On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote:
>
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> >
> > I applied Linus patch, Andrew patch, Peter Zijlstra patches(the last
> > two). All unified patch is attached. I tested and I have no corruption.
>
> That wasn't very interesting, because you also had the patch that just
> disabled "page_mkclean_one()" entirely:
>
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index d8a842a..3f9061e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> > goto unlock;
> >
> > entry = ptep_get_and_clear(mm, address, pte);
> > - entry = pte_mkclean(entry);
> > + /*entry = pte_mkclean(entry);*/
> > entry = pte_wrprotect(entry);
> > ptep_establish(vma, address, pte, entry);
> > lazy_mmu_prot_update(entry);
>
> The above patch is bad. It's always going to hide the bug, but it hides it
> by just not doing anything at all. So any patch combination that contains
> that patch will probably _always_ fix your problem, but it won't be an
> interesting patch..
>
> So can you remove that small fragment? Also, it would be nice if you added
> the WARN_ON() to this sequence in mm/page-writeback.c:
>
> + if (!must_clean_ptes && cleaned)
> + set_page_dirty(page);
>
> just make it do a WARN_ON() if this ever triggers.
>
> Then, IF the corruption is gone, we'd love to see the WARN_ON results..
>
> Linus
I dropped that patch and added WARN_ON(1), the unified patch is
attached.
I got corruption: "Hash check on download completion found bad chunks,
consider using "safe_sync"."
In dmesg there is no message from WARN_ON(1), my .config is attached.
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 1)) {
unlock_page(page);
break;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
ClearPageUptodate(page);
}
#else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 1);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f7e0cc8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory
accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned){
+ WARN_ON(1);
+ set_page_dirty(page);
+ }
+
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 1))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.20-rc1
# Sun Dec 17 01:52:12 2006
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
#
# General setup
#
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
# CONFIG_POSIX_MQUEUE is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
# CONFIG_IKCONFIG_PROC is not set
# CONFIG_CPUSETS is not set
# CONFIG_SYSFS_DEPRECATED is not set
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set
#
# Loadable module support
#
# CONFIG_MODULES is not set
CONFIG_STOP_MACHINE=y
#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set
#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
#
# Processor type and features
#
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_NR_CPUS=8
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
# CONFIG_NOHIGHMEM is not set
CONFIG_HIGHMEM4G=y
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_HIGHMEM=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
# CONFIG_HIGHPTE is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
CONFIG_IRQBALANCE=y
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
# CONFIG_KEXEC is not set
# CONFIG_CRASH_DUMP is not set
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_HOTPLUG_CPU=y
# CONFIG_COMPAT_VDSO is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_PM_SYSFS_DEPRECATED is not set
CONFIG_SOFTWARE_SUSPEND=y
CONFIG_PM_STD_PARTITION=""
CONFIG_SUSPEND_SMP=y
#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_HOTKEY=y
CONFIG_ACPI_FAN=y
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
#
# APM (Advanced Power Management) BIOS Support
#
# CONFIG_APM is not set
#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=y
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE is not set
CONFIG_X86_SPEEDSTEP_ICH=y
# CONFIG_X86_SPEEDSTEP_SMI is not set
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set
#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set
#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_PCIEPORTBUS is not set
CONFIG_PCI_MSI=y
# CONFIG_PCI_MULTITHREAD_PROBE is not set
# CONFIG_PCI_DEBUG is not set
# CONFIG_HT_IRQ is not set
CONFIG_ISA_DMA_API=y
# CONFIG_ISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set
#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set
#
# PCI Hotplug Support
#
# CONFIG_HOTPLUG_PCI is not set
#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=y
#
# Networking
#
CONFIG_NET=y
#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
# CONFIG_IP_MULTICAST is not set
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IPV6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_NETWORK_SECMARK is not set
# CONFIG_NETFILTER is not set
#
# DCCP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP is not set
#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set
#
# TIPC Configuration (EXPERIMENTAL)
#
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
#
# QoS and/or fair queueing
#
# CONFIG_NET_SCHED is not set
#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
CONFIG_BT_SCO=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=y
# CONFIG_BT_BNEP_MC_FILTER is not set
# CONFIG_BT_BNEP_PROTO_FILTER is not set
CONFIG_BT_HIDP=y
#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=y
# CONFIG_BT_HCIUSB_SCO is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_IEEE80211 is not set
CONFIG_WIRELESS_EXT=y
#
# Device Drivers
#
#
# Generic Driver Options
#
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_SYS_HYPERVISOR is not set
#
# Connector - unified userspace <-> kernelspace linker
#
# CONFIG_CONNECTOR is not set
#
# Memory Technology Devices (MTD)
#
# CONFIG_MTD is not set
#
# Parallel port support
#
# CONFIG_PARPORT is not set
#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set
#
# Protocols
#
CONFIG_PNPACPI=y
#
# Block devices
#
CONFIG_BLK_DEV_FD=y
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
#
# Misc devices
#
# CONFIG_IBM_ASM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_MSI_LAPTOP is not set
#
# ATA/ATAPI/MFM/RLL support
#
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y
#
# Please see Documentation/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
# CONFIG_BLK_DEV_HD_IDE is not set
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=y
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDEFLOPPY is not set
CONFIG_BLK_DEV_IDESCSI=y
# CONFIG_IDE_TASK_IOCTL is not set
#
# IDE chipset support/bugfixes
#
CONFIG_IDE_GENERIC=y
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
# CONFIG_BLK_DEV_OFFBOARD is not set
CONFIG_BLK_DEV_GENERIC=y
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
CONFIG_BLK_DEV_IDEDMA_PCI=y
# CONFIG_BLK_DEV_IDEDMA_FORCED is not set
CONFIG_IDEDMA_PCI_AUTO=y
# CONFIG_IDEDMA_ONLYDISK is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CY82C693 is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_CS5535 is not set
# CONFIG_BLK_DEV_HPT34X is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
CONFIG_BLK_DEV_PIIX=y
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_IDE_ARM is not set
CONFIG_BLK_DEV_IDEDMA=y
# CONFIG_IDEDMA_IVB is not set
CONFIG_IDEDMA_AUTO=y
# CONFIG_BLK_DEV_HD is not set
#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y
#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
# CONFIG_SCSI_CONSTANTS is not set
# CONFIG_SCSI_LOGGING is not set
# CONFIG_SCSI_SCAN_ASYNC is not set
#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set
#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
CONFIG_ATA=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
CONFIG_SATA_INTEL_COMBINED=y
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
#
# Multi-device support (RAID and LVM)
#
# CONFIG_MD is not set
#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set
#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=y
#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
# CONFIG_IEEE1394_OUI_DB is not set
# CONFIG_IEEE1394_EXTRA_CONFIG_ROMS is not set
# CONFIG_IEEE1394_EXPORT_FULL_API is not set
#
# Device Drivers
#
#
# Texas Instruments PCILynx requires I2C
#
CONFIG_IEEE1394_OHCI1394=y
#
# Protocol Drivers
#
# CONFIG_IEEE1394_VIDEO1394 is not set
CONFIG_IEEE1394_SBP2=y
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
# CONFIG_IEEE1394_ETH1394 is not set
# CONFIG_IEEE1394_DV1394 is not set
CONFIG_IEEE1394_RAWIO=y
#
# I2O device support
#
# CONFIG_I2O is not set
#
# Network device support
#
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_EQUALIZER is not set
# CONFIG_TUN is not set
# CONFIG_NET_SB1000 is not set
#
# ARCnet devices
#
# CONFIG_ARCNET is not set
#
# PHY device support
#
# CONFIG_PHYLIB is not set
#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
CONFIG_NET_PCI=y
# CONFIG_PCNET32 is not set
# CONFIG_AMD8111_ETH is not set
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_DGRS is not set
# CONFIG_EEPRO100 is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
#
# Ethernet (10000 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_IXGB is not set
# CONFIG_S2IO is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set
#
# Token Ring devices
#
# CONFIG_TR is not set
#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y
# CONFIG_NET_WIRELESS_RTNETLINK is not set
#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set
#
# Wireless 802.11b ISA/PCI cards support
#
# CONFIG_IPW2100 is not set
# CONFIG_IPW2200 is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set
#
# Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
#
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set
CONFIG_NET_WIRELESS=y
#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
#
# ISDN subsystem
#
# CONFIG_ISDN is not set
#
# Telephony Support
#
# CONFIG_PHONE is not set
#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set
#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1280
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=800
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set
#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
CONFIG_INPUT_WISTRON_BTNS=y
# CONFIG_INPUT_UINPUT is not set
#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set
#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
# CONFIG_SERIAL_NONSTANDARD is not set
#
# Serial drivers
#
# CONFIG_SERIAL_8250 is not set
#
# Non-8250 serial port support
#
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set
#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=y
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_GEODE is not set
# CONFIG_HW_RANDOM_VIA is not set
CONFIG_NVRAM=y
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
CONFIG_DRM_I915=y
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
# CONFIG_HPET is not set
# CONFIG_HANGCHECK_TIMER is not set
#
# TPM devices
#
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
#
# I2C support
#
# CONFIG_I2C is not set
#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set
#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set
#
# Hardware Monitoring support
#
# CONFIG_HWMON is not set
# CONFIG_HWMON_VID is not set
#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set
#
# Digital Video Broadcasting Devices
#
# CONFIG_DVB is not set
# CONFIG_USB_DABUSB is not set
#
# Graphics support
#
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB=y
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
CONFIG_FB_VESA=y
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
CONFIG_FB_I810=y
CONFIG_FB_I810_GTF=y
# CONFIG_FB_I810_I2C is not set
CONFIG_FB_INTEL=y
# CONFIG_FB_INTEL_DEBUG is not set
# CONFIG_FB_INTEL_I2C is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_CYBLA is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_VIDEO_SELECT=y
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y
#
# Logo configuration
#
# CONFIG_LOGO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_BACKLIGHT_CLASS_DEVICE=y
CONFIG_BACKLIGHT_DEVICE=y
CONFIG_LCD_CLASS_DEVICE=y
CONFIG_LCD_DEVICE=y
#
# Sound
#
CONFIG_SOUND=y
#
# Advanced Linux Sound Architecture
#
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
CONFIG_SND_SEQUENCER=y
# CONFIG_SND_SEQ_DUMMY is not set
# CONFIG_SND_MIXER_OSS is not set
# CONFIG_SND_PCM_OSS is not set
# CONFIG_SND_SEQUENCER_OSS is not set
CONFIG_SND_RTCTIMER=y
CONFIG_SND_SEQ_RTCTIMER_DEFAULT=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
#
# Generic devices
#
CONFIG_SND_AC97_CODEC=y
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
CONFIG_SND_INTEL8X0M=y
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
# CONFIG_SND_AC97_POWER_SAVE is not set
#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=y
#
# HID Devices
#
# CONFIG_HID is not set
#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
#
# Miscellaneous USB options
#
# CONFIG_USB_DEVICEFS is not set
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_MULTITHREAD_PROBE is not set
# CONFIG_USB_OTG is not set
#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=y
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#
#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=y
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_LIBUSUAL is not set
#
# USB Input Devices
#
#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set
#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET_MII is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_MON is not set
#
# USB port drivers
#
#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set
#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
#
# USB DSL modem support
#
#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set
#
# MMC/SD Card support
#
# CONFIG_MMC is not set
#
# LED devices
#
# CONFIG_NEW_LEDS is not set
#
# LED drivers
#
#
# LED Triggers
#
#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set
#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
# CONFIG_EDAC is not set
#
# Real Time Clock
#
# CONFIG_RTC_CLASS is not set
#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set
#
# DMA Clients
#
#
# DMA Devices
#
#
# Virtualization
#
# CONFIG_KVM is not set
#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
# CONFIG_EXT3_FS_POSIX_ACL is not set
# CONFIG_EXT3_FS_SECURITY is not set
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_FS_POSIX_ACL is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_INOTIFY is not set
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=y
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y
#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
# CONFIG_NTFS_RW is not set
#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
# CONFIG_CONFIGFS_FS is not set
#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
CONFIG_UFS_FS=y
# CONFIG_UFS_FS_WRITE is not set
# CONFIG_UFS_DEBUG is not set
#
# Network File Systems
#
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_SMB_FS is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_XATTR is not set
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set
#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
# CONFIG_EFI_PARTITION is not set
#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set
#
# Distributed Lock Manager
#
# CONFIG_DLM is not set
#
# Instrumentation Support
#
# CONFIG_PROFILING is not set
#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
# CONFIG_DEBUG_FS is not set
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_LOG_BUF_SHIFT=14
# CONFIG_DETECT_SOFTLOCKUP is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
# CONFIG_DEBUG_HIGHMEM is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_FORCED_INLINING is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
#
# Page alloc debug is incompatible with Software Suspend on i386
#
# CONFIG_DEBUG_RODATA is not set
CONFIG_4KSTACKS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y
#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set
#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_MANAGER=y
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_ECB is not set
# CONFIG_CRYPTO_CBC is not set
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_AES=y
CONFIG_CRYPTO_AES_586=y
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
CONFIG_CRYPTO_DEFLATE=y
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_CRC32C is not set
#
# Hardware crypto devices
#
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_GEODE is not set
#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC32=y
CONFIG_LIBCRC32C=y
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_PLIST=y
CONFIG_IOMAP_COPY=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
CONFIG_KTIME_SCALAR=y
On Mon, 18 Dec 2006, Andrei Popa wrote:
>
> I dropped that patch and added WARN_ON(1), the unified patch is
> attached.
>
> I got corruption: "Hash check on download completion found bad chunks,
> consider using "safe_sync"."
Ok. That is actually _very_ interesting.
It's interesting because (a) the corruption obviously goes away with the
one-liner that effectively disables "page_mkclean_one()".
So that tells us that yes, it's a PTE dirty bit that matters.
But at the same time, it's interesting that it still happens when we try
to re-add the dirty bit. That would tell me that it's one of two cases:
- there is another caller of page cleaning that should have done the same
thing (we could check that by just doing this all _inside_ the
page_mkclean() thing)
OR:
- page_mkclean_one() is simply buggy.
And I'm starting to wonder about the second case. But it all LOOKS really
fine - I can't see anything wrong there (it uses the extremely
conservative "ptep_get_and_clear()", and seems to flush everything right
too, through "ptep_establish()").
Linus
On Mon, 18 Dec 2006, Linus Torvalds wrote:
>
> But at the same time, it's interesting that it still happens when we try
> to re-add the dirty bit. That would tell me that it's one of two cases:
Forget that. There's a third case, which is much more likely:
- Andrew's patch had a ", 1" where it _should_ have had a ", 0".
This should be fairly easy to test: just change every single ", 1" case in
the patch to ", 0".
The only case that _definitely_ would want ",1" is actually the case that
already calls page_mkclean() directly: clear_page_dirty_for_io(). So no
other ", 1" is valid, and that one that needed it already avoided even
calling the "test_clear_page_dirty()" function, because it did it all by
hand.
What happens for you in that case?
Linus
On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote:
>
> On Mon, 18 Dec 2006, Linus Torvalds wrote:
> >
> > But at the same time, it's interesting that it still happens when we try
> > to re-add the dirty bit. That would tell me that it's one of two cases:
>
> Forget that. There's a third case, which is much more likely:
>
> - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
>
> This should be fairly easy to test: just change every single ", 1" case in
> the patch to ", 0".
>
> The only case that _definitely_ would want ",1" is actually the case that
> already calls page_mkclean() directly: clear_page_dirty_for_io(). So no
> other ", 1" is valid, and that one that needed it already avoided even
> calling the "test_clear_page_dirty()" function, because it did it all by
> hand.
>
> What happens for you in that case?
>
> Linus
I have file corruption.
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..760442f 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 0)) {
unlock_page(page);
break;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..7b87875 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..47a6b62 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
}
#else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..d65ba84 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f7e0cc8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory
accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -866,7 +866,12 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned){
+ WARN_ON(1);
+ set_page_dirty(page);
+ }
+
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..cafa843 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 0))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
On Mon, 18 Dec 2006 12:14:35 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> OR:
>
> - page_mkclean_one() is simply buggy.
>
> And I'm starting to wonder about the second case. But it all LOOKS really
> fine - I can't see anything wrong there (it uses the extremely
> conservative "ptep_get_and_clear()", and seems to flush everything right
> too, through "ptep_establish()").
What does the call to page_check_address() in there do?
It'd be good to have a printk in there to see if it's triggering.
Is this all correct for non-linear VMAs? (rtorrent doesn't use
MAP_NONLINEAR though).
On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
>
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> >
> > I dropped that patch and added WARN_ON(1), the unified patch is
> > attached.
> >
> > I got corruption: "Hash check on download completion found bad chunks,
> > consider using "safe_sync"."
>
> Ok. That is actually _very_ interesting.
>
> It's interesting because (a) the corruption obviously goes away with the
> one-liner that effectively disables "page_mkclean_one()".
>
> So that tells us that yes, it's a PTE dirty bit that matters.
>
> But at the same time, it's interesting that it still happens when we try
> to re-add the dirty bit. That would tell me that it's one of two cases:
>
> - there is another caller of page cleaning that should have done the same
> thing (we could check that by just doing this all _inside_ the
> page_mkclean() thing)
>
> OR:
>
> - page_mkclean_one() is simply buggy.
>
> And I'm starting to wonder about the second case. But it all LOOKS really
> fine - I can't see anything wrong there (it uses the extremely
> conservative "ptep_get_and_clear()", and seems to flush everything right
> too, through "ptep_establish()").
How about this:
we get confused on what PG_dirty tells us, we fall back to pte_dirty,
transfer pte_dirty to PG_dirty and clear pte_dirty. Now it happens
again, however we don't have pte_dirty to fall back to anymore.
This would explain why disabling pte_mkclean() does make it go away and
non of the other tried approaches works.
We really need a way to sort out PG_dirty, independent of pte_dirty.
On 12/18/06, Andrei Popa <[email protected]> wrote:
> On Mon, 2006-12-18 at 12:41 -0800, Linus Torvalds wrote:
> >
> > On Mon, 18 Dec 2006, Linus Torvalds wrote:
> > >
> > > But at the same time, it's interesting that it still happens when we try
> > > to re-add the dirty bit. That would tell me that it's one of two cases:
> >
> > Forget that. There's a third case, which is much more likely:
> >
> > - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
> >
> > This should be fairly easy to test: just change every single ", 1" case in
> > the patch to ", 0".
> >
> > The only case that _definitely_ would want ",1" is actually the case that
> > already calls page_mkclean() directly: clear_page_dirty_for_io(). So no
> > other ", 1" is valid, and that one that needed it already avoided even
> > calling the "test_clear_page_dirty()" function, because it did it all by
> > hand.
> >
> > What happens for you in that case?
> >
> > Linus
>
> I have file corruption.
No idea whether this can be a data point or not, but
here it goes... my P2P box is about to turn 5 days old
while running nonstop one or both of aMule 2.1.3 and
BitTorrent 4.4.0 on ext3 mounted w/default options
on both IDE and USB disks. Zero corruption.
AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
2.6.19-git20 on top of up-to-date FC6.
--alessandro
"...when I get it, I _get_ it"
(Lara Eidemiller)
On Monday 18 December 2006 15:41, Linus Torvalds wrote:
>On Mon, 18 Dec 2006, Linus Torvalds wrote:
>> But at the same time, it's interesting that it still happens when we
>> try to re-add the dirty bit. That would tell me that it's one of two
>> cases:
>
>Forget that. There's a third case, which is much more likely:
>
> - Andrew's patch had a ", 1" where it _should_ have had a ", 0".
>
>This should be fairly easy to test: just change every single ", 1" case
> in the patch to ", 0".
>
>The only case that _definitely_ would want ",1" is actually the case
> that already calls page_mkclean() directly: clear_page_dirty_for_io().
> So no other ", 1" is valid, and that one that needed it already avoided
> even calling the "test_clear_page_dirty()" function, because it did it
> all by hand.
>
What about the mm/rmap.c one liner, in or out?
Thanks.
>What happens for you in that case?
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.
On Mon, 18 Dec 2006, Andrei Popa wrote:
> >
> > This should be fairly easy to test: just change every single ", 1" case in
> > the patch to ", 0".
> >
> > What happens for you in that case?
>
> I have file corruption.
Magic. And btw, _thanks_ for being such a great tester.
So now I have one more thng for you to try, it you can bother:
There's exactly two call sites that call "page_mkclean()" (an dthat is the
only thing in turn that calls "page_mkclean_one()", which we already
determined will cause the corruption).
Both of them do
if (mapping_cap_account_dirty(mapping)) {
..
things, although they do slightly different things inside that if in your
patched kernel.
Can you just TOTALLY DISABLE that case for the test_clear_page_dirty()
case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving
the _only_ thing that actually calls "page_mkclean()" to be the
"clear_page_dirty_for_io()" call.
Do you still see corruption?
Linus
On Mon, 18 Dec 2006, Alessandro Suardi wrote:
>
> No idea whether this can be a data point or not, but
> here it goes... my P2P box is about to turn 5 days old
> while running nonstop one or both of aMule 2.1.3 and
> BitTorrent 4.4.0 on ext3 mounted w/default options
> on both IDE and USB disks. Zero corruption.
>
> AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
> 2.6.19-git20 on top of up-to-date FC6.
It _looks_ like PREEMPT/SMP is one common configuration.
It might also be that the blocksize of the filesystem matters. 4kB
filesystems are fundamentally simpler than 1kB filesystems, for example.
You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something.
Andrei - one thing that might be interesting to see: when corruption
occurs, can you get the corrupted file somehow? And compare it with a
known-good copy to see what the corruption looks like?
Linus
On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote:
>
> On Mon, 18 Dec 2006, Andrei Popa wrote:
> > >
> > > This should be fairly easy to test: just change every single ", 1" case in
> > > the patch to ", 0".
> > >
> > > What happens for you in that case?
> >
> > I have file corruption.
>
> Magic. And btw, _thanks_ for being such a great tester.
>
> So now I have one more thng for you to try, it you can bother:
>
> There's exactly two call sites that call "page_mkclean()" (an dthat is the
> only thing in turn that calls "page_mkclean_one()", which we already
> determined will cause the corruption).
>
> Both of them do
>
> if (mapping_cap_account_dirty(mapping)) {
> ..
>
> things, although they do slightly different things inside that if in your
> patched kernel.
>
> Can you just TOTALLY DISABLE that case for the test_clear_page_dirty()
> case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving
> the _only_ thing that actually calls "page_mkclean()" to be the
> "clear_page_dirty_for_io()" call.
>
> Do you still see corruption?
nope, no file corruption at all.
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 0)) {
unlock_page(page);
break;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
}
#else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f2a157d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory
accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p
return TestClearPageDirty(page);
write_lock_irqsave(&mapping->tree_lock, flags);
+
+#if 0
if (TestClearPageDirty(page)) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned){
+ WARN_ON(1);
+ set_page_dirty(page);
+ }
+
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
}
+
+#endif
+
write_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
}
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 0))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
On Tue, 19 Dec 2006, Andrei Popa wrote:
> >
> > There's exactly two call sites that call "page_mkclean()" (an dthat is the
> > only thing in turn that calls "page_mkclean_one()", which we already
> > determined will cause the corruption).
> >
> > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty()
> > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving
> > the _only_ thing that actually calls "page_mkclean()" to be the
> > "clear_page_dirty_for_io()" call.
> >
> > Do you still see corruption?
>
> nope, no file corruption at all.
Ok. That's interesting, but I think you actually #ifdef'ed out too
much:
> +
> +#if 0
> if (TestClearPageDirty(page)) {
> radix_tree_tag_clear(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
> @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
> * page is locked, which pins the address_space
> */
> if (mapping_cap_account_dirty(mapping)) {
> - page_mkclean(page);
> + int cleaned = page_mkclean(page);
> + if (!must_clean_ptes && cleaned){
> + WARN_ON(1);
> + set_page_dirty(page);
> + }
> +
> dec_zone_page_state(page, NR_FILE_DIRTY);
> }
> return 1;
> }
> +
> +#endif
> +
It was really just the _inner_ "if (mapping_cap_account_dirty(.."
statement that I meant you should remove.
Can you try that too?
Linus
On Mon, 2006-12-18 at 14:45 -0800, Linus Torvalds wrote:
>
> On Mon, 18 Dec 2006, Alessandro Suardi wrote:
> >
> > No idea whether this can be a data point or not, but
> > here it goes... my P2P box is about to turn 5 days old
> > while running nonstop one or both of aMule 2.1.3 and
> > BitTorrent 4.4.0 on ext3 mounted w/default options
> > on both IDE and USB disks. Zero corruption.
> >
> > AMD K7-800, 512MB RAM, PREEMPT/UP kernel,
> > 2.6.19-git20 on top of up-to-date FC6.
>
> It _looks_ like PREEMPT/SMP is one common configuration.
>
> It might also be that the blocksize of the filesystem matters. 4kB
> filesystems are fundamentally simpler than 1kB filesystems, for example.
> You can tell at least with "/sbin/dumpe2fs -h /dev/..." or something.
>
> Andrei - one thing that might be interesting to see: when corruption
> occurs, can you get the corrupted file somehow? And compare it with a
> known-good copy to see what the corruption looks like?
the corrupted file has a chink full with zeros
http://193.226.119.62/corruption0.jpg
http://193.226.119.62/corruption1.jpg
On Tue, 19 Dec 2006, Andrei Popa wrote:
>
> the corrupted file has a chink full with zeros
>
> http://193.226.119.62/corruption0.jpg
> http://193.226.119.62/corruption1.jpg
Thanks. Yup, filled with zeroes, and the corruption stops (but does _not_
start) at a page boundary.
That _does_ look very much like it was filled in linearly, then written
out to disk when it was in the middle of the page, and then we simply lost
the further writes that should also have gone on to that page. All
consistent with dropping a dirty bit somewhere in the middle of the page
updates.
Which we kind of knew must be the issue anyway, but it's good to know that
the corruption pattern is consistent with what we're trying to figure out.
Linus
On Mon, 2006-12-18 at 16:04 -0800, Linus Torvalds wrote:
>
> On Tue, 19 Dec 2006, Andrei Popa wrote:
> > >
> > > There's exactly two call sites that call "page_mkclean()" (an dthat is the
> > > only thing in turn that calls "page_mkclean_one()", which we already
> > > determined will cause the corruption).
> > >
> > > Can you just TOTALLY DISABLE that case for the test_clear_page_dirty()
> > > case? Just do an "#if 0 .. #endif" around that whole if-statement, leaving
> > > the _only_ thing that actually calls "page_mkclean()" to be the
> > > "clear_page_dirty_for_io()" call.
> > >
> > > Do you still see corruption?
> >
> > nope, no file corruption at all.
>
> Ok. That's interesting, but I think you actually #ifdef'ed out too
> much:
>
> > +
> > +#if 0
> > if (TestClearPageDirty(page)) {
> > radix_tree_tag_clear(&mapping->page_tree,
> > page_index(page), PAGECACHE_TAG_DIRTY);
> > @@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
> > * page is locked, which pins the address_space
> > */
> > if (mapping_cap_account_dirty(mapping)) {
> > - page_mkclean(page);
> > + int cleaned = page_mkclean(page);
> > + if (!must_clean_ptes && cleaned){
> > + WARN_ON(1);
> > + set_page_dirty(page);
> > + }
> > +
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > }
> > return 1;
> > }
> > +
> > +#endif
> > +
>
> It was really just the _inner_ "if (mapping_cap_account_dirty(.."
> statement that I meant you should remove.
>
> Can you try that too?
I have file corruption: "Hash check on download completion found bad
chunks, consider using "safe_sync"."
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 0)) {
unlock_page(page);
break;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
}
#else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..4ff7f90 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory
accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p
return TestClearPageDirty(page);
write_lock_irqsave(&mapping->tree_lock, flags);
+
if (TestClearPageDirty(page)) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p
* We can continue to use `mapping' here because the
* page is locked, which pins the address_space
*/
+
+#if 0
+
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned){
+ WARN_ON(1);
+ set_page_dirty(page);
+ }
+
dec_zone_page_state(page, NR_FILE_DIRTY);
}
+#endif
+
return 1;
}
+
write_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
}
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 0))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
On Tue, 19 Dec 2006, Andrei Popa wrote:
> > >
> > > nope, no file corruption at all.
> >
> > Ok. That's interesting, but I think you actually #ifdef'ed out too
> > much:
> >
> > It was really just the _inner_ "if (mapping_cap_account_dirty(.."
> > statement that I meant you should remove.
> >
> > Can you try that too?
>
> I have file corruption: "Hash check on download completion found bad
> chunks, consider using "safe_sync"."
Ok, that's interesting.
So it doesn't seem to be the call to page_mkclean() itself that causes
corruption. It looks like Peter's hunch that maybe there's some bug in
PG_dirty handling _itself_ might be an idea..
And the reason it only started happening now is that it may just have been
_hidden_ by the fact that while we kept the dirty bits in the page tables,
we'd end up writing the dirty page _despite_ having lost the PG_dirty bit.
So if it's some bad interaction between writable mappings and some other
part of the system, we just didn't see it earlier, exactly because we had
_lots_ of dirty bits, and it was enough that _one_ of them was right.
If you didn't see corruption when you #ifdef'ed out too much of the
"test_clean_page_dirty() function (the _whole_ TestClearPageDirty()
if-statement), but you get it when you just comment out the stuff that
does the page_mkclean(), that's interesting.
I'm left lookin gat the "radix_tree_tag_clear()" in
test_clear_page_dirty().
What happens if you only ifdef out that single thing?
The actual page-cleaning functions make sure to only clear the TAG_DIRTY
bit _after_ the page has been marked for writeback. Is there some ordering
constraint there, perhaps?
I'm really reaching here. I'm trying to see the pattern, and I'm not
seeing it. I'm asking you to test things just to get more of a feel for
what triggers the failure, than because I actually have any kind of idea
of what the heck is going on.
Andrew, Nick, Hugh - any ideas?
Linus
On Monday 18 December 2006 18:48, Andrei Popa wrote:
>On Mon, 2006-12-18 at 14:32 -0800, Linus Torvalds wrote:
>> On Mon, 18 Dec 2006, Andrei Popa wrote:
>> > > This should be fairly easy to test: just change every single ", 1"
>> > > case in the patch to ", 0".
>> > >
>> > > What happens for you in that case?
>> >
>> > I have file corruption.
>>
>> Magic. And btw, _thanks_ for being such a great tester.
>>
>> So now I have one more thng for you to try, it you can bother:
>>
>> There's exactly two call sites that call "page_mkclean()" (an dthat is
>> the only thing in turn that calls "page_mkclean_one()", which we
>> already determined will cause the corruption).
>>
>> Both of them do
>>
>> if (mapping_cap_account_dirty(mapping)) {
>> ..
>>
>> things, although they do slightly different things inside that if in
>> your patched kernel.
>>
>> Can you just TOTALLY DISABLE that case for the test_clear_page_dirty()
>> case? Just do an "#if 0 .. #endif" around that whole if-statement,
>> leaving the _only_ thing that actually calls "page_mkclean()" to be
>> the "clear_page_dirty_for_io()" call.
>>
>> Do you still see corruption?
>
>nope, no file corruption at all.
>
Goody I says to nobody in particular, I'll go build this...
>
>diff --git a/fs/buffer.c b/fs/buffer.c
>index d1f1b54..263f88e 100644
>--- a/fs/buffer.c
>+++ b/fs/buffer.c
>@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
> int ret = 0;
>
> BUG_ON(!PageLocked(page));
>- if (PageWriteback(page))
>+ if (PageDirty(page) || PageWriteback(page))
> return 0;
>
> if (mapping == NULL) { /* can this still happen? */
>@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
>- if (ret) {
>- /*
>- * If the filesystem writes its buffers by hand (eg ext3)
>- * then we can have clean buffers against a dirty page. We
>- * clean the page here; otherwise later reattachment of buffers
>- * could encounter a non-uptodate page, which is unresolvable.
>- * This only applies in the rare case where try_to_free_buffers
>- * succeeds but the page is not freed.
>- *
>- * Also, during truncate, discard_buffer will have marked all
>- * the page's buffers clean. We discover that here and clean
>- * the page also.
>- */
>- if (test_clear_page_dirty(page))
>- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
>- }
> out:
> if (buffers_to_free) {
> struct buffer_head *bh = buffers_to_free;
>diff --git a/fs/cifs/file.c b/fs/cifs/file.c
>index 0f05cab..2d8bbbb 100644
>--- a/fs/cifs/file.c
>+++ b/fs/cifs/file.c
>@@ -1245,7 +1245,7 @@ retry:
> wait_on_page_writeback(page);
>
> if (PageWriteback(page) ||
>- !test_clear_page_dirty(page)) {
>+ !test_clear_page_dirty(page, 0)) {
> unlock_page(page);
> break;
> }
>diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>index 1387749..da2bdb1 100644
>--- a/fs/fuse/file.c
>+++ b/fs/fuse/file.c
>@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
> spin_unlock(&fc->lock);
>
> if (offset == 0 && to == PAGE_CACHE_SIZE) {
>- clear_page_dirty(page);
>+ clear_page_dirty(page, 0);
> SetPageUptodate(page);
> }
> }
>diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>index ed2c223..9f82cd0 100644
>--- a/fs/hugetlbfs/inode.c
>+++ b/fs/hugetlbfs/inode.c
>@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
>
> static void truncate_huge_page(struct page *page)
> {
>- clear_page_dirty(page);
>+ clear_page_dirty(page, 0);
> ClearPageUptodate(page);
> remove_from_page_cache(page);
> put_page(page);
>diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
>index b1a1c72..5e29b37 100644
>--- a/fs/jfs/jfs_metapage.c
>+++ b/fs/jfs/jfs_metapage.c
>@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
>
> /* Retest mp->count since we may have released page lock */
> if (test_bit(META_discard, &mp->flag) && !mp->count) {
>- clear_page_dirty(page);
>+ clear_page_dirty(page, 0);
> ClearPageUptodate(page);
> }
> #else
>diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
>index 47e7027..a97e198 100644
>--- a/fs/reiserfs/stree.c
>+++ b/fs/reiserfs/stree.c
>@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
> bh = next;
> } while (bh != head);
> if (PAGE_SIZE == bh->b_size) {
>- clear_page_dirty(page);
>+ clear_page_dirty(page, 0);
> }
> }
> }
>diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
>index b56eb75..44ac434 100644
>--- a/fs/xfs/linux-2.6/xfs_aops.c
>+++ b/fs/xfs/linux-2.6/xfs_aops.c
>@@ -343,7 +343,7 @@ xfs_start_page_writeback(
> ASSERT(!PageWriteback(page));
> set_page_writeback(page);
> if (clear_dirty)
>- clear_page_dirty(page);
>+ clear_page_dirty(page, 0);
> unlock_page(page);
> if (!buffers) {
> end_page_writeback(page);
>diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
>index 4830a3b..175ab3c 100644
>--- a/include/linux/page-flags.h
>+++ b/include/linux/page-flags.h
>@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
>
> struct page; /* forward declaration */
>
>-int test_clear_page_dirty(struct page *page);
>+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
> int test_clear_page_writeback(struct page *page);
> int test_set_page_writeback(struct page *page);
>
>-static inline void clear_page_dirty(struct page *page)
>+static inline void clear_page_dirty(struct page *page, int
>must_clean_ptes)
above looks wrapped to me so I fixed it to one line
> {
>- test_clear_page_dirty(page);
>+ test_clear_page_dirty(page, must_clean_ptes);
> }
>
> static inline void set_page_writeback(struct page *page)
>diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>index 237107c..f2a157d 100644
>--- a/mm/page-writeback.c
>+++ b/mm/page-writeback.c
>@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> * Clear a page's dirty flag, while caring for dirty memory
>accounting.
Likewise here, malformed patch otherwise
> * Returns true if the page was previously dirty.
> */
>-int test_clear_page_dirty(struct page *page)
>+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
> {
> struct address_space *mapping = page_mapping(page);
> unsigned long flags;
>@@ -857,6 +857,8 @@ int test_clear_page_dirty(struct page *p
> return TestClearPageDirty(page);
>
> write_lock_irqsave(&mapping->tree_lock, flags);
>+
>+#if 0
> if (TestClearPageDirty(page)) {
> radix_tree_tag_clear(&mapping->page_tree,
> page_index(page), PAGECACHE_TAG_DIRTY);
>@@ -866,11 +868,19 @@ int test_clear_page_dirty(struct page *p
> * page is locked, which pins the address_space
> */
> if (mapping_cap_account_dirty(mapping)) {
>- page_mkclean(page);
>+ int cleaned = page_mkclean(page);
>+ if (!must_clean_ptes && cleaned){
>+ WARN_ON(1);
>+ set_page_dirty(page);
>+ }
>+
> dec_zone_page_state(page, NR_FILE_DIRTY);
> }
> return 1;
> }
>+
>+#endif
>+
> write_unlock_irqrestore(&mapping->tree_lock, flags);
> return 0;
> }
>diff --git a/mm/rmap.c b/mm/rmap.c
>diff --git a/mm/truncate.c b/mm/truncate.c
>index 9bfb8e8..9a01d9e 100644
>--- a/mm/truncate.c
>+++ b/mm/truncate.c
>@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
> if (PagePrivate(page))
> do_invalidatepage(page, 0);
>
>- if (test_clear_page_dirty(page))
>+ if (test_clear_page_dirty(page, 0))
> task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> ClearPageUptodate(page);
> ClearPageMappedToDisk(page);
>@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
> PAGE_CACHE_SIZE, 0);
> }
> }
>- was_dirty = test_clear_page_dirty(page);
>+ was_dirty = test_clear_page_dirty(page, 0);
> if (!invalidate_complete_page2(mapping, page)) {
> if (was_dirty)
> set_page_dirty(page);
>
I think I must have screwed the moose. Following along in this thread,
I'd patched things back and forth till I figured I'd better do a fresh
tree, so starting with the full 2.6.19 tarball, I applied the 2.6.20-rc1
patch, then the above patch, which should be the only thing different
from what I'm running right now, which is the commented line in rmap.c,
otherwise as it unpacked.
But:
In file included from include/linux/mm.h:230,
from include/linux/rmap.h:10,
from init/main.c:47:
include/linux/page-flags.h:260: error: expected declaration specifiers
or ‘...’ before ‘in’
include/linux/page-flags.h: In function ‘clear_page_dirty’:
include/linux/page-flags.h:262: error: ‘must_clean_ptes’ undeclared (first
use in this function)
include/linux/page-flags.h:262: error: (Each undeclared identifier is
reported only once
include/linux/page-flags.h:262: error: for each function it appears in.)
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2
There were 2 places where this patch is word wrapped, and this was one of
them:
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
The other one was in a comment, which screwed the patch and needed fixed
too. Is it fubared someplace else I missed? Or am I in fact being
bitten by this bug?
--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.
On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> What happens if you only ifdef out that single thing?
>
> The actual page-cleaning functions make sure to only clear the TAG_DIRTY
> bit _after_ the page has been marked for writeback. Is there some ordering
> constraint there, perhaps?
>
> I'm really reaching here. I'm trying to see the pattern, and I'm not
> seeing it. I'm asking you to test things just to get more of a feel for
> what triggers the failure, than because I actually have any kind of idea
> of what the heck is going on.
>
> Andrew, Nick, Hugh - any ideas?
If all of test_clear_page_dirty() has been commented out then the page will
never become clean hence will never fall out of pagecache, so unless Andrei
is doing a reboot before checking for corruption, perhaps the underlying
data on-disk is incorrect, but we can't see it.
Andrei, how _are_ you running this test? What's the exact sequence of steps?
In particular, are you doing anything which would cause the corrupted file
to be evicted from memory, thus forcing a read from disk? Such as
unmounting and then remounting the filesystem?
The point of my question is to check that the data is really incorrect
on-disk, or whether it is incorrect in pagecache.
Also, it'd be useful if you could determine whether the bug appears with
the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
rootfstype=ext2 if it's the root filesystem.
Thanks.
On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote:
> On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
> > What happens if you only ifdef out that single thing?
> >
> > The actual page-cleaning functions make sure to only clear the TAG_DIRTY
> > bit _after_ the page has been marked for writeback. Is there some ordering
> > constraint there, perhaps?
> >
> > I'm really reaching here. I'm trying to see the pattern, and I'm not
> > seeing it. I'm asking you to test things just to get more of a feel for
> > what triggers the failure, than because I actually have any kind of idea
> > of what the heck is going on.
> >
> > Andrew, Nick, Hugh - any ideas?
>
> If all of test_clear_page_dirty() has been commented out then the page will
> never become clean hence will never fall out of pagecache, so unless Andrei
> is doing a reboot before checking for corruption, perhaps the underlying
> data on-disk is incorrect, but we can't see it.
if I do a sync and echo 1 > /proc/sys/vm/drop_caches does the reboot is
still necesary ?
>
> Andrei, how _are_ you running this test? What's the exact sequence of steps?
>
> In particular, are you doing anything which would cause the corrupted file
> to be evicted from memory, thus forcing a read from disk? Such as
> unmounting and then remounting the filesystem?
I boot linux, I start rtorrent and start the download, while it's
downloading I start evolution and i check my mail(my mbox is very large,
several hundered megabytes), I close evolution(I use evolution just to
have another application witch uses the filesystem and the memory), I
start evolution again. I start firefox. The download is complete.
Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
test that all 84 downloaded rar files are ok and see the result.
>
> The point of my question is to check that the data is really incorrect
> on-disk, or whether it is incorrect in pagecache.
>
> Also, it'd be useful if you could determine whether the bug appears with
> the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> rootfstype=ext2 if it's the root filesystem.
I will test.
>
> Thanks.
On Mon, 2006-12-18 at 16:57 -0800, Linus Torvalds wrote:
>
> On Tue, 19 Dec 2006, Andrei Popa wrote:
> > > >
> > > > nope, no file corruption at all.
> > >
> > > Ok. That's interesting, but I think you actually #ifdef'ed out too
> > > much:
> > >
> > > It was really just the _inner_ "if (mapping_cap_account_dirty(.."
> > > statement that I meant you should remove.
> > >
> > > Can you try that too?
> >
> > I have file corruption: "Hash check on download completion found bad
> > chunks, consider using "safe_sync"."
>
> Ok, that's interesting.
>
> So it doesn't seem to be the call to page_mkclean() itself that causes
> corruption. It looks like Peter's hunch that maybe there's some bug in
> PG_dirty handling _itself_ might be an idea..
>
> And the reason it only started happening now is that it may just have been
> _hidden_ by the fact that while we kept the dirty bits in the page tables,
> we'd end up writing the dirty page _despite_ having lost the PG_dirty bit.
> So if it's some bad interaction between writable mappings and some other
> part of the system, we just didn't see it earlier, exactly because we had
> _lots_ of dirty bits, and it was enough that _one_ of them was right.
>
> If you didn't see corruption when you #ifdef'ed out too much of the
> "test_clean_page_dirty() function (the _whole_ TestClearPageDirty()
> if-statement), but you get it when you just comment out the stuff that
> does the page_mkclean(), that's interesting.
>
> I'm left lookin gat the "radix_tree_tag_clear()" in
> test_clear_page_dirty().
>
> What happens if you only ifdef out that single thing?
I have file corruption.
>
> The actual page-cleaning functions make sure to only clear the TAG_DIRTY
> bit _after_ the page has been marked for writeback. Is there some ordering
> constraint there, perhaps?
>
> I'm really reaching here. I'm trying to see the pattern, and I'm not
> seeing it. I'm asking you to test things just to get more of a feel for
> what triggers the failure, than because I actually have any kind of idea
> of what the heck is going on.
>
> Andrew, Nick, Hugh - any ideas?
>
> Linus
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 0)) {
unlock_page(page);
break;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
}
#else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..4ff7f90 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory
accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -857,6 +857,7 @@ int test_clear_page_dirty(struct page *p
return TestClearPageDirty(page);
write_lock_irqsave(&mapping->tree_lock, flags);
+
if (TestClearPageDirty(page)) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -865,12 +866,23 @@ int test_clear_page_dirty(struct page *p
* We can continue to use `mapping' here because the
* page is locked, which pins the address_space
*/
+
+#if 0
+
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned){
+ WARN_ON(1);
+ set_page_dirty(page);
+ }
+
dec_zone_page_state(page, NR_FILE_DIRTY);
}
+#endif
+
return 1;
}
+
write_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
}
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 0))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 0f05cab..2d8bbbb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1245,7 +1245,7 @@ retry:
wait_on_page_writeback(page);
if (PageWriteback(page) ||
- !test_clear_page_dirty(page)) {
+ !test_clear_page_dirty(page, 0)) {
unlock_page(page);
break;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 1387749..da2bdb1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -484,7 +484,7 @@ static int fuse_commit_write(struct file
spin_unlock(&fc->lock);
if (offset == 0 && to == PAGE_CACHE_SIZE) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
SetPageUptodate(page);
}
}
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..9f82cd0 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
index b1a1c72..5e29b37 100644
--- a/fs/jfs/jfs_metapage.c
+++ b/fs/jfs/jfs_metapage.c
@@ -773,7 +773,7 @@ #if MPS_PER_PAGE == 1
/* Retest mp->count since we may have released page lock */
if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
ClearPageUptodate(page);
}
#else
diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 47e7027..a97e198 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1459,7 +1459,7 @@ static void unmap_buffers(struct page *p
bh = next;
} while (bh != head);
if (PAGE_SIZE == bh->b_size) {
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
}
}
}
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b56eb75..44ac434 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -343,7 +343,7 @@ xfs_start_page_writeback(
ASSERT(!PageWriteback(page));
set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty(page, 0);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..175ab3c 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,13 +253,13 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty(struct page *page, int must_clean_ptes);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
+static inline void clear_page_dirty(struct page *page, int
must_clean_ptes)
{
- test_clear_page_dirty(page);
+ test_clear_page_dirty(page, must_clean_ptes);
}
static inline void set_page_writeback(struct page *page)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..e6524a6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -848,7 +848,7 @@ EXPORT_SYMBOL(set_page_dirty_lock);
* Clear a page's dirty flag, while caring for dirty memory
accounting.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+int test_clear_page_dirty(struct page *page, int must_clean_ptes)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -857,20 +857,35 @@ int test_clear_page_dirty(struct page *p
return TestClearPageDirty(page);
write_lock_irqsave(&mapping->tree_lock, flags);
+
if (TestClearPageDirty(page)) {
+
+#if 0
+
radix_tree_tag_clear(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
+
+#endif
+
write_unlock_irqrestore(&mapping->tree_lock, flags);
/*
* We can continue to use `mapping' here because the
* page is locked, which pins the address_space
*/
+
+
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ int cleaned = page_mkclean(page);
+ if (!must_clean_ptes && cleaned){
+ WARN_ON(1);
+ set_page_dirty(page);
+ }
+
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
}
+
write_unlock_irqrestore(&mapping->tree_lock, flags);
return 0;
}
diff --git a/mm/rmap.c b/mm/rmap.c
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..9a01d9e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
+ if (test_clear_page_dirty(page, 0))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
@@ -386,7 +386,7 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
+ was_dirty = test_clear_page_dirty(page, 0);
if (!invalidate_complete_page2(mapping, page)) {
if (was_dirty)
set_page_dirty(page);
On Tue, 19 Dec 2006 03:44:51 +0200
Andrei Popa <[email protected]> wrote:
> On Mon, 2006-12-18 at 17:21 -0800, Andrew Morton wrote:
> > On Mon, 18 Dec 2006 16:57:30 -0800 (PST)
> > Linus Torvalds <[email protected]> wrote:
> >
> > > What happens if you only ifdef out that single thing?
> > >
> > > The actual page-cleaning functions make sure to only clear the TAG_DIRTY
> > > bit _after_ the page has been marked for writeback. Is there some ordering
> > > constraint there, perhaps?
> > >
> > > I'm really reaching here. I'm trying to see the pattern, and I'm not
> > > seeing it. I'm asking you to test things just to get more of a feel for
> > > what triggers the failure, than because I actually have any kind of idea
> > > of what the heck is going on.
> > >
> > > Andrew, Nick, Hugh - any ideas?
> >
> > If all of test_clear_page_dirty() has been commented out then the page will
> > never become clean hence will never fall out of pagecache, so unless Andrei
> > is doing a reboot before checking for corruption, perhaps the underlying
> > data on-disk is incorrect, but we can't see it.
>
> if I do a sync and echo 1 > /proc/sys/vm/drop_caches
OK, that works.
> does the reboot is
> still necesary ?
It might be necessary to reboot in this case - if we're leaving the
pagecache dirty, writing to drop_caches won't remove it. And you probably
won't be able to get a clean reboot either.
> >
> > Andrei, how _are_ you running this test? What's the exact sequence of steps?
> >
> > In particular, are you doing anything which would cause the corrupted file
> > to be evicted from memory, thus forcing a read from disk? Such as
> > unmounting and then remounting the filesystem?
>
> I boot linux, I start rtorrent and start the download, while it's
> downloading I start evolution and i check my mail(my mbox is very large,
> several hundered megabytes), I close evolution(I use evolution just to
> have another application witch uses the filesystem and the memory), I
> start evolution again. I start firefox. The download is complete.
> Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
> test that all 84 downloaded rar files are ok and see the result.
>
> >
> > The point of my question is to check that the data is really incorrect
> > on-disk, or whether it is incorrect in pagecache.
> >
> > Also, it'd be useful if you could determine whether the bug appears with
> > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > rootfstype=ext2 if it's the root filesystem.
>
> I will test.
ok, thanks.
> > > If all of test_clear_page_dirty() has been commented out then the page will
> > > never become clean hence will never fall out of pagecache, so unless Andrei
> > > is doing a reboot before checking for corruption, perhaps the underlying
> > > data on-disk is incorrect, but we can't see it.
> >
> > if I do a sync and echo 1 > /proc/sys/vm/drop_caches
>
> OK, that works.
>
> > does the reboot is
> > still necesary ?
>
> It might be necessary to reboot in this case - if we're leaving the
> pagecache dirty, writing to drop_caches won't remove it. And you probably
> won't be able to get a clean reboot either.
>
> > >
> > > Andrei, how _are_ you running this test? What's the exact sequence of steps?
> > >
> > > In particular, are you doing anything which would cause the corrupted file
> > > to be evicted from memory, thus forcing a read from disk? Such as
> > > unmounting and then remounting the filesystem?
> >
> > I boot linux, I start rtorrent and start the download, while it's
> > downloading I start evolution and i check my mail(my mbox is very large,
> > several hundered megabytes), I close evolution(I use evolution just to
> > have another application witch uses the filesystem and the memory), I
> > start evolution again. I start firefox. The download is complete.
> > Rtorrent says if the hash is good or not. I do a "unrar t qwe.rar" to
> > test that all 84 downloaded rar files are ok and see the result.
> >
> > >
> > > The point of my question is to check that the data is really incorrect
> > > on-disk, or whether it is incorrect in pagecache.
I rebooted and the files are still broken after reboot(tested twice) so
the data is incorrect on disk.
> > >
> > > Also, it'd be useful if you could determine whether the bug appears with
> > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > rootfstype=ext2 if it's the root filesystem.
> >
> > I will test.
Will test In a couple of hours, I have some work to do...
>
> ok, thanks.
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2006-12-19 15:15:46.000000000 +1100
+++ linux-2.6/fs/buffer.c 2006-12-19 15:36:01.000000000 +1100
@@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
* This only applies in the rare case where try_to_free_buffers
* succeeds but the page is not freed.
*/
- clear_page_dirty(page);
+
+ /*
+ * If the page has been dirtied via the user mappings, then
+ * clean buffers does not indicate the page data is actually
+ * clean! Only clear the page dirty bit if there are no dirty
+ * ptes either.
+ *
+ * If there are dirty ptes, then the page must be uptodate, so
+ * the above concern does not apply.
+ */
+ clear_page_dirty_sync_ptes(page);
}
out:
if (buffers_to_free) {
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2006-12-19 15:17:18.000000000 +1100
+++ linux-2.6/include/linux/page-flags.h 2006-12-19 15:34:24.000000000 +1100
@@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
+int test_clear_page_dirty_sync_ptes(struct page *page);
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
@@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
test_clear_page_dirty(page);
}
+static inline void clear_page_dirty_sync_ptes(struct page *page)
+{
+ test_clear_page_dirty_sync_ptes(page);
+}
+
static inline void set_page_writeback(struct page *page)
{
test_set_page_writeback(page);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c 2006-12-19 15:17:53.000000000 +1100
+++ linux-2.6/mm/page-writeback.c 2006-12-19 15:33:29.000000000 +1100
@@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
+ * Does not clear pte dirty bits.
* Returns true if the page was previously dirty.
*/
-int test_clear_page_dirty(struct page *page)
+static int test_clear_page_dirty_leave_ptes(struct page *page)
{
struct address_space *mapping = page_mapping(page);
unsigned long flags;
@@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
* We can continue to use `mapping' here because the
* page is locked, which pins the address_space
*/
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ if (mapping_cap_account_dirty(mapping))
dec_zone_page_state(page, NR_FILE_DIRTY);
- }
return 1;
}
write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
}
return TestClearPageDirty(page);
}
+
+/*
+ * As above, but does clear dirty bits from ptes
+ */
+int test_clear_page_dirty(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+
+ if (test_clear_page_dirty_leave_ptes(page)) {
+ if (mapping_cap_account_dirty(mapping))
+ page_mkclean(page);
+ return 1;
+ }
+ return 0;
+}
EXPORT_SYMBOL(test_clear_page_dirty);
/*
+ * As above, but redirties page if any dirty ptes are found (and then only
+ * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
+ * but the page is cleaned).
+ */
+int test_clear_page_dirty_sync_ptes(struct page *page)
+{
+ struct address_space *mapping = page_mapping(page);
+
+ if (test_clear_page_dirty_leave_ptes(page)) {
+ if (mapping_cap_account_dirty(mapping)) {
+ if (page_mkclean(page))
+ set_page_dirty(page);
+ }
+ return 1;
+ }
+ return 0;
+}
+
+/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*
On Tue, 19 Dec 2006, Nick Piggin wrote:
>
> We never want to drop dirty data! (ignoring the truncate case, which is
> handled privately by truncate anyway)
Bzzt.
SURE we do.
We absolutely do want to drop dirty data in the writeout path.
How do you think dirty data ever _becomes_ clean data?
In other words, yes, we _do_ want to test-and-clear all the pgtable bits
_and_ the PG_dirty bit. We want to do it for:
- writeout
- truncate
- possibly a "drop" event (which could be a case for a journal entry that
becomes stale due to being replaced or something - kind of "truncate"
on metadata)
because both of those events _literally_ turn dirty state into clean
state.
In no other circumstance do we ever want to clear a dirty bit, as far as I
can tell.
Linus
Linus Torvalds wrote:
>
> On Tue, 19 Dec 2006, Nick Piggin wrote:
>
>>We never want to drop dirty data! (ignoring the truncate case, which is
>>handled privately by truncate anyway)
>
>
> Bzzt.
>
> SURE we do.
>
> We absolutely do want to drop dirty data in the writeout path.
>
> How do you think dirty data ever _becomes_ clean data?
I wouldn't have thought it becomes clean by dropping it ;) Is this a
trick question? My answer is that we clean a page by by taking some
action such that the underlying data matches the data in RAM...
We don't "drop" any data until it has been cleaned (again, ignoring
things like truncate for a minute). That's a bug! And
try_to_free_buffers() is called from places outside the writeout path.
This is our bug (or at least, one of our bugs that appears to have the
same triggers and symptoms as people are reporting).
[...]
> In no other circumstance do we ever want to clear a dirty bit, as far as I
> can tell.
Exactly. And that is exactly what try_to_free_buffers is doing now.
I still think you should have a look at the patch.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote:
> plain text document attachment (fs-fix.patch)
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c 2006-12-19 15:15:46.000000000 +1100
> +++ linux-2.6/fs/buffer.c 2006-12-19 15:36:01.000000000 +1100
> @@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
> * This only applies in the rare case where try_to_free_buffers
> * succeeds but the page is not freed.
> */
> - clear_page_dirty(page);
> +
> + /*
> + * If the page has been dirtied via the user mappings, then
> + * clean buffers does not indicate the page data is actually
> + * clean! Only clear the page dirty bit if there are no dirty
> + * ptes either.
> + *
> + * If there are dirty ptes, then the page must be uptodate, so
> + * the above concern does not apply.
> + */
> + clear_page_dirty_sync_ptes(page);
> }
> out:
> if (buffers_to_free) {
> Index: linux-2.6/include/linux/page-flags.h
> ===================================================================
> --- linux-2.6.orig/include/linux/page-flags.h 2006-12-19 15:17:18.000000000 +1100
> +++ linux-2.6/include/linux/page-flags.h 2006-12-19 15:34:24.000000000 +1100
> @@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
> struct page; /* forward declaration */
>
> int test_clear_page_dirty(struct page *page);
> +int test_clear_page_dirty_sync_ptes(struct page *page);
> int test_clear_page_writeback(struct page *page);
> int test_set_page_writeback(struct page *page);
>
> @@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
> test_clear_page_dirty(page);
> }
>
> +static inline void clear_page_dirty_sync_ptes(struct page *page)
> +{
> + test_clear_page_dirty_sync_ptes(page);
> +}
> +
> static inline void set_page_writeback(struct page *page)
> {
> test_set_page_writeback(page);
> Index: linux-2.6/mm/page-writeback.c
> ===================================================================
> --- linux-2.6.orig/mm/page-writeback.c 2006-12-19 15:17:53.000000000 +1100
> +++ linux-2.6/mm/page-writeback.c 2006-12-19 15:33:29.000000000 +1100
> @@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>
> /*
> * Clear a page's dirty flag, while caring for dirty memory accounting.
> + * Does not clear pte dirty bits.
> * Returns true if the page was previously dirty.
> */
> -int test_clear_page_dirty(struct page *page)
> +static int test_clear_page_dirty_leave_ptes(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
> unsigned long flags;
> @@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
> * We can continue to use `mapping' here because the
> * page is locked, which pins the address_space
> */
> - if (mapping_cap_account_dirty(mapping)) {
> - page_mkclean(page);
> + if (mapping_cap_account_dirty(mapping))
> dec_zone_page_state(page, NR_FILE_DIRTY);
> - }
> return 1;
> }
> write_unlock_irqrestore(&mapping->tree_lock, flags);
> @@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
> }
> return TestClearPageDirty(page);
> }
> +
> +/*
> + * As above, but does clear dirty bits from ptes
> + */
> +int test_clear_page_dirty(struct page *page)
> +{
> + struct address_space *mapping = page_mapping(page);
> +
> + if (test_clear_page_dirty_leave_ptes(page)) {
> + if (mapping_cap_account_dirty(mapping))
> + page_mkclean(page);
> + return 1;
> + }
> + return 0;
> +}
> EXPORT_SYMBOL(test_clear_page_dirty);
>
> /*
> + * As above, but redirties page if any dirty ptes are found (and then only
> + * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
> + * but the page is cleaned).
> + */
> +int test_clear_page_dirty_sync_ptes(struct page *page)
> +{
> + struct address_space *mapping = page_mapping(page);
> +
> + if (test_clear_page_dirty_leave_ptes(page)) {
> + if (mapping_cap_account_dirty(mapping)) {
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + }
> + return 1;
> + }
> + return 0;
> +}
> +
> +/*
> * Clear a page's dirty flag, while caring for dirty memory accounting.
> * Returns true if the page was previously dirty.
> *
Hmm, not quite; It certainly look better than the extra ,[01] tagged to
test_clear_page_dirty() though. Although I would have expected it the
other way around - test_clear_pages_dirty_sync_ptes to be the default
case and test_clear_pages_dirty_clean_ptes to be used in
clear_page_dirty_for_io().
Anyway it has the same issues as the others. See what happens when you
run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
PG_dirty even though the page might actually be dirty.
On Tue, 19 Dec 2006, Nick Piggin wrote:
>
> I wouldn't have thought it becomes clean by dropping it ;) Is this a
> trick question? My answer is that we clean a page by by taking some
> action such that the underlying data matches the data in RAM...
Sure.
> We don't "drop" any data until it has been cleaned (again, ignoring
> things like truncate for a minute). That's a bug!
Actually, it's the other way around. We have to drop the dirty bits BEFORE
cleaning. If we clean first, and _then_ drop the dirty bits, THAT is a
bug, because the dirty bits can now refer to _new_ dirty data that didn't
get written out.
So the proper sequence is _literally_ to mark the page clean FIRST. Drop
all the dirty bits, but not the _data_ obviously (ie you have a reference
to the page). And _then_ you do the writeout to actually clean the data
itself.
So you actually state it exactly the wrogn way around.
We MUST clear the dirty bits before we do the IO that actually cleans the
data. Exactly because if new writes keep on happening, if we do it in the
other order, we'll drop dirty data on the floor.
> > In no other circumstance do we ever want to clear a dirty bit, as far as I
> > can tell.
>
> Exactly. And that is exactly what try_to_free_buffers is doing now.
>
> I still think you should have a look at the patch.
I claim that dropping dirty bits AFTER the IO is always wrong.
Try_to_free_buffers() must never touch the dirty bits at all, because by
definition that thing happens after the IO has actually been done.
Anbd yes, I looked at your patch. And it looks a million times cleaner
than Andrew's patch. However, it's already been tested multiple times, and
totally REMOVING the "clear_page_dirty()" from try_to_free_buffers() still
resulted in the corruption.
That said, I think your patch is worth it just as a cleanup. Much nicer
than Andrews code, also from a naming standpoint. So I'm not actually
disagreeing about the patch itself, but I _am_ saying that I don't
actually see the point of ever moving the dirty bits around.
So I repeat: we have the case where we really want to _remove_ the dirty
bits (because we're going to write the current state of the page to disk,
and we need to clear the dirty bits BEFORE we do that). That's the one
that makes sense, and that's the code we want to run before doing IO. It's
the "clear_dirty_bits_for_io()" case.
The code that doesn't make sense is the "shuffle the dirty bits around" In
other words: when does it actually make sense to call your
(well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()"
function? It doesn't _fix_ anything. It just shuffles dirty bits from one
place to another. What was the point again?
If the point is "try_to_free_buffers()", then my argument was that I had a
much simpler solution: "Just don't do that then". My simple patch sadly
didn't fix the data corruption, so the data corruption comes from
something ELSE than try_to_free_buffers().
Linus
On Mon, 2006-12-18 at 11:18 -0800, Linus Torvalds wrote:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index d8a842a..3f9061e 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -448,7 +448,7 @@ static int page_mkclean_one(struct page
> > goto unlock;
> >
> > entry = ptep_get_and_clear(mm, address, pte);
> > - entry = pte_mkclean(entry);
> > + /*entry = pte_mkclean(entry);*/
> > entry = pte_wrprotect(entry);
> > ptep_establish(vma, address, pte, entry);
> > lazy_mmu_prot_update(entry);
>
> The above patch is bad. It's always going to hide the bug, but it hides it
> by just not doing anything at all.
Not quite, it does wrprotect still, so further updates will trigger the
do_wp_page() path and call set_page_dirty().
So we could make 'something' that would keep the tracking working and
not create corruption, say something like this:
However I'll try and figure out how we get so terribly confused on the
PG_dirty state that we have to clean it and fall back to pte_dirty. That
is the real issue we have.
---
include/linux/rmap.h | 6 ++++++
mm/page-writeback.c | 3 ++-
mm/rmap.c | 23 ++++++++++++++++++-----
3 files changed, 26 insertions(+), 6 deletions(-)
Index: linux-2.6-git/mm/rmap.c
===================================================================
--- linux-2.6-git.orig/mm/rmap.c 2006-12-18 11:06:29.000000000 +0100
+++ linux-2.6-git/mm/rmap.c 2006-12-19 08:33:57.000000000 +0100
@@ -428,7 +428,8 @@ int page_referenced(struct page *page, i
return referenced;
}
-static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+static int page_mkcw_one(struct page *page,
+ struct vm_area_struct *vma, int make_clean)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -448,7 +449,8 @@ static int page_mkclean_one(struct page
goto unlock;
entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
+ if (make_clean)
+ entry = pte_mkclean(entry);
entry = pte_wrprotect(entry);
ptep_establish(vma, address, pte, entry);
lazy_mmu_prot_update(entry);
@@ -460,7 +462,8 @@ out:
return ret;
}
-static int page_mkclean_file(struct address_space *mapping, struct page *page)
+static int page_mkcw_file(struct address_space *mapping,
+ struct page *page, int make_clean)
{
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
struct vm_area_struct *vma;
@@ -478,7 +481,7 @@ static int page_mkclean_file(struct addr
return ret;
}
-int page_mkclean(struct page *page)
+static int page_mkcw(struct page *page, int make_clean)
{
int ret = 0;
@@ -487,12 +490,22 @@ int page_mkclean(struct page *page)
if (page_mapped(page)) {
struct address_space *mapping = page_mapping(page);
if (mapping)
- ret = page_mkclean_file(mapping, page);
+ ret = page_mkcw_file(mapping, page, make_clean);
}
return ret;
}
+int page_mkclean(struct page *page)
+{
+ return page_mkcw(page, 1);
+}
+
+int page_wrprotect(struct page *page)
+{
+ return page_mkcw(page, 0);
+}
+
/**
* page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
Index: linux-2.6-git/include/linux/rmap.h
===================================================================
--- linux-2.6-git.orig/include/linux/rmap.h 2006-12-19 08:31:59.000000000 +0100
+++ linux-2.6-git/include/linux/rmap.h 2006-12-19 08:32:28.000000000 +0100
@@ -110,6 +110,7 @@ unsigned long page_address_in_vma(struct
* returns the number of cleaned PTEs.
*/
int page_mkclean(struct page *);
+int page_wrprotect(struct page *);
#else /* !CONFIG_MMU */
@@ -125,6 +126,11 @@ static inline int page_mkclean(struct pa
return 0;
}
+static inline int page_wrprotect(struct page *page)
+{
+ return 0;
+}
+
#endif /* CONFIG_MMU */
Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c 2006-12-19 08:31:43.000000000 +0100
@@ -872,7 +872,8 @@ int test_clear_page_dirty(struct page *p
* page is locked, which pins the address_space
*/
if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ if (page_wrprotect(page))
+ set_page_dirty();
dec_zone_page_state(page, NR_FILE_DIRTY);
}
return 1;
Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 15:36 +1100, Nick Piggin wrote:
>
>
>>plain text document attachment (fs-fix.patch)
>>Index: linux-2.6/fs/buffer.c
>>===================================================================
>>--- linux-2.6.orig/fs/buffer.c 2006-12-19 15:15:46.000000000 +1100
>>+++ linux-2.6/fs/buffer.c 2006-12-19 15:36:01.000000000 +1100
>>@@ -2852,7 +2852,17 @@ int try_to_free_buffers(struct page *pag
>> * This only applies in the rare case where try_to_free_buffers
>> * succeeds but the page is not freed.
>> */
>>- clear_page_dirty(page);
>>+
>>+ /*
>>+ * If the page has been dirtied via the user mappings, then
>>+ * clean buffers does not indicate the page data is actually
>>+ * clean! Only clear the page dirty bit if there are no dirty
>>+ * ptes either.
>>+ *
>>+ * If there are dirty ptes, then the page must be uptodate, so
>>+ * the above concern does not apply.
>>+ */
>>+ clear_page_dirty_sync_ptes(page);
>> }
>> out:
>> if (buffers_to_free) {
>>Index: linux-2.6/include/linux/page-flags.h
>>===================================================================
>>--- linux-2.6.orig/include/linux/page-flags.h 2006-12-19 15:17:18.000000000 +1100
>>+++ linux-2.6/include/linux/page-flags.h 2006-12-19 15:34:24.000000000 +1100
>>@@ -254,6 +254,7 @@ static inline void SetPageUptodate(struc
>> struct page; /* forward declaration */
>>
>> int test_clear_page_dirty(struct page *page);
>>+int test_clear_page_dirty_sync_ptes(struct page *page);
>> int test_clear_page_writeback(struct page *page);
>> int test_set_page_writeback(struct page *page);
>>
>>@@ -262,6 +263,11 @@ static inline void clear_page_dirty(stru
>> test_clear_page_dirty(page);
>> }
>>
>>+static inline void clear_page_dirty_sync_ptes(struct page *page)
>>+{
>>+ test_clear_page_dirty_sync_ptes(page);
>>+}
>>+
>> static inline void set_page_writeback(struct page *page)
>> {
>> test_set_page_writeback(page);
>>Index: linux-2.6/mm/page-writeback.c
>>===================================================================
>>--- linux-2.6.orig/mm/page-writeback.c 2006-12-19 15:17:53.000000000 +1100
>>+++ linux-2.6/mm/page-writeback.c 2006-12-19 15:33:29.000000000 +1100
>>@@ -844,9 +844,10 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>>
>> /*
>> * Clear a page's dirty flag, while caring for dirty memory accounting.
>>+ * Does not clear pte dirty bits.
>> * Returns true if the page was previously dirty.
>> */
>>-int test_clear_page_dirty(struct page *page)
>>+static int test_clear_page_dirty_leave_ptes(struct page *page)
>> {
>> struct address_space *mapping = page_mapping(page);
>> unsigned long flags;
>>@@ -862,10 +863,8 @@ int test_clear_page_dirty(struct page *p
>> * We can continue to use `mapping' here because the
>> * page is locked, which pins the address_space
>> */
>>- if (mapping_cap_account_dirty(mapping)) {
>>- page_mkclean(page);
>>+ if (mapping_cap_account_dirty(mapping))
>> dec_zone_page_state(page, NR_FILE_DIRTY);
>>- }
>> return 1;
>> }
>> write_unlock_irqrestore(&mapping->tree_lock, flags);
>>@@ -873,9 +872,43 @@ int test_clear_page_dirty(struct page *p
>> }
>> return TestClearPageDirty(page);
>> }
>>+
>>+/*
>>+ * As above, but does clear dirty bits from ptes
>>+ */
>>+int test_clear_page_dirty(struct page *page)
>>+{
>>+ struct address_space *mapping = page_mapping(page);
>>+
>>+ if (test_clear_page_dirty_leave_ptes(page)) {
>>+ if (mapping_cap_account_dirty(mapping))
>>+ page_mkclean(page);
>>+ return 1;
>>+ }
>>+ return 0;
>>+}
>> EXPORT_SYMBOL(test_clear_page_dirty);
>>
>> /*
>>+ * As above, but redirties page if any dirty ptes are found (and then only
>>+ * if the mapping accounts dirty pages, otherwise dirty ptes are left dirty
>>+ * but the page is cleaned).
>>+ */
>>+int test_clear_page_dirty_sync_ptes(struct page *page)
>>+{
>>+ struct address_space *mapping = page_mapping(page);
>>+
>>+ if (test_clear_page_dirty_leave_ptes(page)) {
>>+ if (mapping_cap_account_dirty(mapping)) {
>>+ if (page_mkclean(page))
>>+ set_page_dirty(page);
>>+ }
>>+ return 1;
>>+ }
>>+ return 0;
>>+}
>>+
>>+/*
>> * Clear a page's dirty flag, while caring for dirty memory accounting.
>> * Returns true if the page was previously dirty.
>> *
>
>
> Hmm, not quite; It certainly look better than the extra ,[01] tagged to
> test_clear_page_dirty() though. Although I would have expected it the
> other way around - test_clear_pages_dirty_sync_ptes to be the default
> case and test_clear_pages_dirty_clean_ptes to be used in
> clear_page_dirty_for_io().
>
> Anyway it has the same issues as the others. See what happens when you
> run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
> PG_dirty even though the page might actually be dirty.
How can this happen? We'll only test_clear_page_dirty_sync_ptes again
after buffers have been reattached, and subsequently cleaned. And in
that case if the ptes are still clean at this point then the page really
is clean.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
> > > Also, it'd be useful if you could determine whether the bug appears with
> > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > rootfstype=ext2 if it's the root filesystem.
> >
I fave file corruption.
On Mon, 18 Dec 2006, Linus Torvalds wrote:
>
> The code that doesn't make sense is the "shuffle the dirty bits around" In
> other words: when does it actually make sense to call your
> (well-implemented, don't get me wrong) "test_clear_page_dirty_sync_ptes()"
> function? It doesn't _fix_ anything. It just shuffles dirty bits from one
> place to another. What was the point again?
Let me try to phrase that another way, in terms that you defined.
In other words, look at your test_clear_page_dirty_sync_ptes() function.
First, start out from the _inner_ part, the:
if (mapping_cap_account_dirty(mapping)) {
if (page_mkclean(page))
set_page_dirty(page);
}
part.
This the one that both you and I agree is a "working" situation: we are
moving the dirty bits from the pte into the "struct page", and we both
agree that this is fine. No dirty bits get lost. You even make a BIG DEAL
about the fact that no dirty bits get lost.
So begin by just explaining:
- why do it?
Why shuffle the dirty bits around? Why not just _leave_ the PG_dirty bit
on the "struct page", and simply leave it all at that? I agree that the
above doesn't lose any dirty bits, but what I'm asking for is WHAT IS THE
POINT?
So that is the code that we both agree "works", but I personally don't see
the _point_ of. However, that's actually not even important, because I
don't even care about the point. I wanted to bring that up just in order
to then ignore it, and look at the stuff _around: it, namely the other
part in "test_clear_page_dirty_sync_ptes()":
int test_clear_page_dirty_sync_ptes(struct page *page)
{
if (test_clear_page_dirty_leave_ptes(page)) {
.. do the inner part ..
return 1;
}
return 0;
}
Now, the above is the OUTER part. Please realize that this DOES actually
drop the PG-dirty bit. So ignore the inner part entirely (which is a no-op
for the case where the page isn't mapped), and explain to me why it's ok
to DROP the dirty bit in the outer part, when you tried to say that it was
NOT ok to drop it in the inner part?
NOTICE? First you make a BIG DEAL about how dirty bits should never get
lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop
the dirty bit for when it's not in the page tables.
In fact, if you just call that function twice, the first time it will
MOVE the dirty bits from the PTE to the "struct page *", and the _second_
time it will just clear the dirty bit from the "struct page *". You end up
with a clean page. It returned the same return value BOTH TIMES, even
though it did two very different things (once just moving dirty bits
around, and the second time actually _removing_ the dirty bit entirely).
Again, I have a very simple claim: I claim that NONE of the
"test_clear_page_dirty()" functions make any sense what-so-ever. They're
all wrong.
The "funny" part is, that the only thing that Andrei reports actually
fixed his corruption (apart from the patch tjhat just stops removign the
dirty bits from the PTE's _entirely_) is actually the part where he had an
"#if 0 .. #endif" around basically _all_ of the "test_clear_page_dirty()"
function (ie he had mis-understood what I asked for, and put it outside
the _outer_ if(), rather than putting it around the inner one).
So I claim:
- there is ONE and only ONE place where you can really drop the dirty
bits: it's when you're going to immediately afterwards do a writeout.
This is the "clear_page_dirty_for_io()"
- all the other "[test_and_]clear_dirty*()" functions seem to be outright
buggy and bogus. Shuffling dirty bits around from the page tables to
the "struct page *" (after having _cleared_ that "very important"
PG_dirty bit just before - apparently it wasn't that important after
all, was it?) is insane.
Nobody has actually ever explained why "test_clear_page_dirty()" is good
at all.
- Why is it ever used instead of "clear_page_dirty_for_io()"?
- What is the difference?
- Why would you EVER want to clear bits just in the "struct page *" or
just in the PTE's?
- Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
In other words, I have a theory:
"A lot of this is actually historical cruft. Some of it may even be code
that was never supposed to work, but because we maintained _other_ dirty
bits in the PTE's, and never touched them before, we never even realized
that the code that played with PG_dirty was totally insane"
Now, that's just a theory. And yeah, it may be stated a bit provocatively.
It may not be entirely correct. I'm just saying.. maybe it is?
And yes, we actually really _do_ have a data-point from Andrei that says
that if you just make "test_clear_page_dirty()" a no-op, the corruption
goes away. It was unintentional, bit hey, it's a real datapoint.
See the email from Andrei:
Subject: Re: 2.6.19 file content corruption on ext3
From: Andrei Popa <[email protected]>
Date: Tue, 19 Dec 2006 01:48:11 +0200
Message-Id: <1166485691.6977.6.camel@localhost>
and look at what remains of his "test_clear_page_dirty()".
Scary, isn't it? And a big hint that "test_clear_page_dirty()" is just
totally BOGUS.
And the thing is, I think it's bogus just because I don't understand why
it would EVER be ok to drop those dirty bits _except_ very much just
before doing the IO that makes it non-dirty (where "truncate()" is really
a special case where the IO ends up being not done, but it's the same kind
of situation).
Linus
On Tue, 19 Dec 2006, Nick Piggin wrote:
> >
> > Anyway it has the same issues as the others. See what happens when you
> > run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
> > PG_dirty even though the page might actually be dirty.
>
> How can this happen? We'll only test_clear_page_dirty_sync_ptes again
> after buffers have been reattached, and subsequently cleaned. And in
> that case if the ptes are still clean at this point then the page really
> is clean.
Why do you talk about buffers being reattached? Are you still in some
world where "try_to_free_buffers()" matters? Have you not followed the
discussion? Why do you ignore my MUCH SIMPLER patch that just removed all
this crap ENTIRELY from "try_to_free_buffers()", and the exact same
corruption happened?
Forget about "try_to_free_buffers()". Please apply this patch to your tree
first. That gets rid of _one_ copy of totally insane code that did all the
wrong things.
Only after you have applied this patch should you look at the code again.
Realizing that the corruption still happens.
So forget about buffers already. That piece of code was crap.
Linus
---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
On Tue, 19 Dec 2006 10:05:03 +0200
Andrei Popa <[email protected]> wrote:
> > > > Also, it'd be useful if you could determine whether the bug appears with
> > > > the ext2 filesystem: do s/ext3/ext2/ in /etc/fstab, or boot with
> > > > rootfstype=ext2 if it's the root filesystem.
> > >
> I fave file corruption.
Wow. I didn't expect that, because Mark Haber reported that ext3's data=writeback
fixed it. Maybe he didn't run it for long enough?
On 12/19/06, Andrew Morton <[email protected]> wrote:
> Wow. I didn't expect that, because Mark Haber reported that ext3's data=writeback
> fixed it. Maybe he didn't run it for long enough?
I don't think it did fix it for Mark:
http://marc.theaimsgroup.com/?l=linux-kernel&m=116625777306843&w=2
On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote:
> Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> blocksize ext3, mainline. Zero failures. It's unlikely that this testing
> would pass, yet people running normal workloads are able to easily trigger
> failures. I suspect we're looking in the wrong place.
I do not have a clue about memory management at all, but is it
possible that you're testing on a box with too much memory? My box has
only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
server, and the box used to be like 150 MB in swap.
I have tidied my inbox in the mean time and mutt's memory requirement
has been reduced to somewhat 30 MB, which might be the cause that I
don't see the issue that often any more.
Greetings
Marc, just trying to give input
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:
> Nobody has actually ever explained why "test_clear_page_dirty()" is good
> at all.
>
> - Why is it ever used instead of "clear_page_dirty_for_io()"?
>
> - What is the difference?
>
> - Why would you EVER want to clear bits just in the "struct page *" or
> just in the PTE's?
>
> - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
>
> In other words, I have a theory:
>
> "A lot of this is actually historical cruft. Some of it may even be code
> that was never supposed to work, but because we maintained _other_ dirty
> bits in the PTE's, and never touched them before, we never even realized
> that the code that played with PG_dirty was totally insane"
>
> Now, that's just a theory. And yeah, it may be stated a bit provocatively.
> It may not be entirely correct. I'm just saying.. maybe it is?
On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
> try_to_free_buffers() clears the page's dirty state if it successfully removed
> the page's buffers.
>
> Background for this:
>
> - a process does a one-byte-write to a file on a 64k pagesize, 4k
> blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and
> has one dirty buffer and 15 not uptodate buffers.
>
> - kjournald writes the dirty buffer. The page is now PageDirty,
> !PageUptodate and has a mix of clean and not uptodate buffers.
>
> - try_to_free_buffers() removes the page's buffers. It MUST now clear
> PageDirty. If we were to leave the page dirty then we'd have a dirty, not
> uptodate page with no buffer_heads.
>
> We're screwed: we cannot write the page because we don't know which
> sections of it contain garbage. We cannot read the page because we don't
> know which sections of it contain modified data. We cannot free the page
> because it is dirty.
However!! this is not true for mapped pages because mapped pages must
have the whole (16k in akpm's example) page loaded. Hence I suspect that
what Andrei did by accident - remove the if (mapping) case in
test_clean_dirty_pages() - is actually totally correct.
On Tue, 2006-12-19 at 10:00 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 00:04 -0800, Linus Torvalds wrote:
>
> > Nobody has actually ever explained why "test_clear_page_dirty()" is good
> > at all.
> >
> > - Why is it ever used instead of "clear_page_dirty_for_io()"?
> >
> > - What is the difference?
> >
> > - Why would you EVER want to clear bits just in the "struct page *" or
> > just in the PTE's?
> >
> > - Why is it EVER correct to clear dirty bits except JUST BEFORE THE IO?
> >
> > In other words, I have a theory:
> >
> > "A lot of this is actually historical cruft. Some of it may even be code
> > that was never supposed to work, but because we maintained _other_ dirty
> > bits in the PTE's, and never touched them before, we never even realized
> > that the code that played with PG_dirty was totally insane"
> >
> > Now, that's just a theory. And yeah, it may be stated a bit provocatively.
> > It may not be entirely correct. I'm just saying.. maybe it is?
>
> On Sun, 2006-12-17 at 15:40 -0800, Andrew Morton wrote:
>
> > try_to_free_buffers() clears the page's dirty state if it successfully removed
> > the page's buffers.
> >
> > Background for this:
> >
> > - a process does a one-byte-write to a file on a 64k pagesize, 4k
> > blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and
> > has one dirty buffer and 15 not uptodate buffers.
> >
> > - kjournald writes the dirty buffer. The page is now PageDirty,
> > !PageUptodate and has a mix of clean and not uptodate buffers.
> >
> > - try_to_free_buffers() removes the page's buffers. It MUST now clear
> > PageDirty. If we were to leave the page dirty then we'd have a dirty, not
> > uptodate page with no buffer_heads.
> >
> > We're screwed: we cannot write the page because we don't know which
> > sections of it contain garbage. We cannot read the page because we don't
> > know which sections of it contain modified data. We cannot free the page
> > because it is dirty.
>
> However!! this is not true for mapped pages because mapped pages must
> have the whole (16k in akpm's example) page loaded. Hence I suspect that
> what Andrei did by accident - remove the if (mapping) case in
> test_clean_dirty_pages() - is actually totally correct.
Obviously I need my morning shot, 64k ofcourse.
On Tue, Dec 19, 2006 at 12:24:16AM -0800, Andrew Morton wrote:
> Wow. I didn't expect that, because Mark Haber reported that ext3's data=writeback
> fixed it. Maybe he didn't run it for long enough?
My test case is Debian's "aptitude update" running once an hour, and
it was always the same file getting corrupted. With 2.6.19, I had this
corruption like every third hour (but -only- if run from cron, running
from a shell was always fine), data=writeback made the issue disappear
for about two days before I booted into 2.6.19.1 without
data=writeback (defaults chosen then), after which the issue only
shows up like every other day.
So, I feel like out of the loop since rtorrent seems much better in
reproducing this.
I notice, though, that both aptitude and rtorrent do downloads from
the net, so there might be a relation to tcp/ip and/or the network
driver. My box has a Linksys NC100 network card running with the tulip
driver.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
* Marc Haber <[email protected]> [2006-12-19 09:51]:
> I do not have a clue about memory management at all, but is it
> possible that you're testing on a box with too much memory? My box has
> only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
> taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
> server, and the box used to be like 150 MB in swap.
FWIW, the ARM box I see this on has only 32 MB memory (and a 133 or
266 MHz CPU). I don't see it on another ARM box (different ARM
sub-arch) with 128 MB memory and a 600 MHz CPU.
--
Martin Michlmayr
http://www.cyrius.com/
Linus Torvalds wrote:
>
> On Tue, 19 Dec 2006, Nick Piggin wrote:
>
>>>Anyway it has the same issues as the others. See what happens when you
>>>run two test_clear_page_dirty_sync_ptes() consecutively, you still loose
>>>PG_dirty even though the page might actually be dirty.
>>
>>How can this happen? We'll only test_clear_page_dirty_sync_ptes again
>>after buffers have been reattached, and subsequently cleaned. And in
>>that case if the ptes are still clean at this point then the page really
>>is clean.
>
>
> Why do you talk about buffers being reattached? Are you still in some
> world where "try_to_free_buffers()" matters? Have you not followed the
I'm talking about fixing just the race Andrew noticed via inspection. No
it doesn't appear to fix Andrei's problem, unfortunately. But it needs
to be fixed all the same, doesn't it?
> discussion? Why do you ignore my MUCH SIMPLER patch that just removed all
> this crap ENTIRELY from "try_to_free_buffers()", and the exact same
> corruption happened?
>
> Forget about "try_to_free_buffers()". Please apply this patch to your tree
> first. That gets rid of _one_ copy of totally insane code that did all the
> wrong things.
>
> Only after you have applied this patch should you look at the code again.
> Realizing that the corruption still happens.
>
> So forget about buffers already. That piece of code was crap.
Now I'm not exactly sure how ext3 (or any other) filesystems make use
of this particular feature of try_to_free_buffers(), but it is clear
from the comments what it is for. So your patch isn't really a minimal
fix (ie. it would require an OK from all filesystems, wouldn't it?)
Or did I miss a mail where you reasoned that it is safe to make this
change (/me goes to reread the thread)...
>
> Linus
>
> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
> int ret = 0;
>
> BUG_ON(!PageLocked(page));
> - if (PageWriteback(page))
> + if (PageDirty(page) || PageWriteback(page))
> return 0;
>
> if (mapping == NULL) { /* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
> - if (ret) {
> - /*
> - * If the filesystem writes its buffers by hand (eg ext3)
> - * then we can have clean buffers against a dirty page. We
> - * clean the page here; otherwise later reattachment of buffers
> - * could encounter a non-uptodate page, which is unresolvable.
> - * This only applies in the rare case where try_to_free_buffers
> - * succeeds but the page is not freed.
> - *
> - * Also, during truncate, discard_buffer will have marked all
> - * the page's buffers clean. We discover that here and clean
> - * the page also.
> - */
> - if (test_clear_page_dirty(page))
> - task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> - }
> out:
> if (buffers_to_free) {
> struct buffer_head *bh = buffers_to_free;
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 19 Dec 2006 20:56:50 +1100
Nick Piggin <[email protected]> wrote:
> Linus Torvalds wrote:
>
> > NOTICE? First you make a BIG DEAL about how dirty bits should never get
> > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop
> > the dirty bit for when it's not in the page tables.
>
> try_to_free_buffers is quite a special case, where we're transferring
> the page dirty metadata from the buffers to the page. I think Andrew
> would have a better grasp of it so he could correct me, but what it
> does is legitimate.
Well it used to be. After 2.6.19 it can do the wrong thing for mapped
pages. But it turns out that we don't feed it mapped pages, apart from
pagevec_strip() and possibly races against pagefaults.
> I think it could be very likely that indeed the bug is a latent one in
> a clear_page_dirty caller, rather than dirty-tracking itself.
The only callers are try_to_free_buffers(), truncate and a few scruffy
possibly-wrong-for-fsync filesytems which aren't being used here.
<spots a race in do_no_page()>
If a write-fault races with a read-fault and the write-fault loses, we forget
to mark the page dirty.
Something like this, but it's probably wrong - I didn't try very hard (am
feeling ill, and vaguely grumpy)
From: Andrew Morton <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/memory.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff -puN mm/memory.c~a mm/memory.c
--- a/mm/memory.c~a
+++ a/mm/memory.c
@@ -2264,10 +2264,22 @@ retry:
}
} else {
/* One of our sibling threads was faster, back out. */
+ if (write_access) {
+ /*
+ * We might have raced against a read-fault. We still
+ * need to dirty the page.
+ */
+ dirty_page = vm_normal_page(vma, address, *page_table);
+ if (dirty_page) {
+ get_page(dirty_page);
+ goto dirty_it;
+ }
+ }
page_cache_release(new_page);
goto unlock;
}
+dirty_it:
/* no need to invalidate: a not-present page shouldn't be cached */
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
_
Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <[email protected]> wrote:
>
>
>>Linus Torvalds wrote:
>>
>>
>>>NOTICE? First you make a BIG DEAL about how dirty bits should never get
>>>lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop
>>>the dirty bit for when it's not in the page tables.
>>
>>try_to_free_buffers is quite a special case, where we're transferring
>>the page dirty metadata from the buffers to the page. I think Andrew
>>would have a better grasp of it so he could correct me, but what it
>>does is legitimate.
>
>
> Well it used to be. After 2.6.19 it can do the wrong thing for mapped
> pages.
Yes, that is what I was trying to get at.
> But it turns out that we don't feed it mapped pages, apart from
> pagevec_strip() and possibly races against pagefaults.
True, and I think we have pretty well established that this isn't the
cause of Andrei's problem, but I think we all agree it is *a* bug?
And surely Andrei's data corruption will be of the same flavour in
that test_clear_page_dirty somewhere is now stripping pte dirty bits
where it shouldn't? (because it went away after Peter nooped that
behaviour)
>>I think it could be very likely that indeed the bug is a latent one in
>>a clear_page_dirty caller, rather than dirty-tracking itself.
>
>
> The only callers are try_to_free_buffers(), truncate and a few scruffy
> possibly-wrong-for-fsync filesytems which aren't being used here.
>
>
> <spots a race in do_no_page()>
>
> If a write-fault races with a read-fault and the write-fault loses, we forget
> to mark the page dirty.
Hmm.. in that case will the pte still be readonly, and thus the write
faulter will have to try again I think?
>
> Something like this, but it's probably wrong - I didn't try very hard (am
> feeling ill, and vaguely grumpy)
>
>
> From: Andrew Morton <[email protected]>
>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/memory.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff -puN mm/memory.c~a mm/memory.c
> --- a/mm/memory.c~a
> +++ a/mm/memory.c
> @@ -2264,10 +2264,22 @@ retry:
> }
> } else {
> /* One of our sibling threads was faster, back out. */
> + if (write_access) {
> + /*
> + * We might have raced against a read-fault. We still
> + * need to dirty the page.
> + */
> + dirty_page = vm_normal_page(vma, address, *page_table);
> + if (dirty_page) {
> + get_page(dirty_page);
> + goto dirty_it;
> + }
> + }
> page_cache_release(new_page);
> goto unlock;
> }
>
> +dirty_it:
> /* no need to invalidate: a not-present page shouldn't be cached */
> update_mmu_cache(vma, address, entry);
> lazy_mmu_prot_update(entry);
> _
>
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 19 Dec 2006 02:32:55 -0800
Andrew Morton <[email protected]> wrote:
> <spots a race in do_no_page()>
>
> If a write-fault races with a read-fault and the write-fault loses, we forget
> to mark the page dirty.
No that isn't right, is it. The writer just retakes the fault and
all the right things happen. Ho hum.
On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <[email protected]> wrote:
>
> > Linus Torvalds wrote:
> >
> > > NOTICE? First you make a BIG DEAL about how dirty bits should never get
> > > lost, but THE VERY SAME FUNCTION actually very much on purpose DOES drop
> > > the dirty bit for when it's not in the page tables.
> >
> > try_to_free_buffers is quite a special case, where we're transferring
> > the page dirty metadata from the buffers to the page. I think Andrew
> > would have a better grasp of it so he could correct me, but what it
> > does is legitimate.
>
> Well it used to be. After 2.6.19 it can do the wrong thing for mapped
> pages. But it turns out that we don't feed it mapped pages, apart from
> pagevec_strip() and possibly races against pagefaults.
So how about this:
Index: linux-2.6-git/mm/page-writeback.c
===================================================================
--- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.000000000 +0100
+++ linux-2.6-git/mm/page-writeback.c 2006-12-19 11:43:31.000000000 +0100
@@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
struct address_space *mapping = page_mapping(page);
unsigned long flags;
+ if (page_mapped(page))
+ return 0;
+
if (!mapping)
return TestClearPageDirty(page);
Andrew Morton wrote:
> On Tue, 19 Dec 2006 20:56:50 +1100
> Nick Piggin <[email protected]> wrote:
>>I think it could be very likely that indeed the bug is a latent one in
>>a clear_page_dirty caller, rather than dirty-tracking itself.
>
>
> The only callers are try_to_free_buffers(), truncate and a few scruffy
> possibly-wrong-for-fsync filesytems which aren't being used here.
Well truncate/invalidate will not operate on mapped pages (barring the
very-unlikely truncate/invalidate vs fault races). We can ignore those
filesystems as they don't include ext3. Which brings us back to
try_to_free_buffers().
Maybe it is something else entirely, but did try_to_free_buffers ever
get completely cleared? Or was some of Andrei's corruption possibly
leftover on-disk corruption from a previous kernel?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
>>Well it used to be. After 2.6.19 it can do the wrong thing for mapped
>>pages. But it turns out that we don't feed it mapped pages, apart from
>>pagevec_strip() and possibly races against pagefaults.
>
>
> So how about this:
Well that's still racy. Anyway several earlier patches (including
the one I posted) closed this race. Some were still reported to
trigger corruption IIRC.
> Index: linux-2.6-git/mm/page-writeback.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page-writeback.c 2006-12-19 08:24:48.000000000 +0100
> +++ linux-2.6-git/mm/page-writeback.c 2006-12-19 11:43:31.000000000 +0100
> @@ -859,6 +859,9 @@ int test_clear_page_dirty(struct page *p
> struct address_space *mapping = page_mapping(page);
> unsigned long flags;
>
> + if (page_mapped(page))
> + return 0;
> +
> if (!mapping)
> return TestClearPageDirty(page);
>
>
>
> -
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 2006-12-19 at 21:58 +1100, Nick Piggin wrote:
> Peter Zijlstra wrote:
> > On Tue, 2006-12-19 at 02:32 -0800, Andrew Morton wrote:
>
> >>Well it used to be. After 2.6.19 it can do the wrong thing for mapped
> >>pages. But it turns out that we don't feed it mapped pages, apart from
> >>pagevec_strip() and possibly races against pagefaults.
> >
> >
> > So how about this:
>
> Well that's still racy. Anyway several earlier patches (including
> the one I posted) closed this race. Some were still reported to
> trigger corruption IIRC.
I can't remember a patch that removes mapped pages from this code path,
however I could have missed it. All out removing the mapping branch in
ttfb() did also fix the problem - which is a superset of page_mapped().
I'm now building a kernel with this patch, and will submit that to
rtorrent with mem=256M on a 1k ext3 filesystem on x86_64 smp preempt.
---
fs/buffer.c | 32 +++++++++++++++++++++++++++++++-
1 file changed, 31 insertions(+), 1 deletion(-)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -2798,11 +2798,38 @@ static inline int buffer_busy(struct buf
(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
}
+/*
+ * AKPM sayeth:
+ *
+ * - a process does a one-byte-write to a file on a 64k pagesize, 4k
+ * blocksize ext3 filesystem. The page is now PageDirty, !PgeUptodate and
+ * has one dirty buffer and 15 not uptodate buffers.
+ *
+ * - kjournald writes the dirty buffer. The page is now PageDirty,
+ * !PageUptodate and has a mix of clean and not uptodate buffers.
+ *
+ * - try_to_free_buffers() removes the page's buffers. It MUST now clear
+ * PageDirty. If we were to leave the page dirty then we'd have a dirty, not
+ * uptodate page with no buffer_heads.
+ *
+ * We're screwed: we cannot write the page because we don't know which
+ * sections of it contain garbage. We cannot read the page because we don't
+ * know which sections of it contain modified data. We cannot free the page
+ * because it is dirty.
+ *
+ * However for mapped pages this is not true; mapped pages will be fully
+ * loaded and thus cannot have not uptodate buffers.
+ *
+ * Hence allow the PG_dirty bit to stay for pages that had no not uptodate
+ * buffers (and assert that mapped pages never have those).
+ */
+
static int
drop_buffers(struct page *page, struct buffer_head **buffers_to_free)
{
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh;
+ int uptodate = 1;
bh = head;
do {
@@ -2818,11 +2845,14 @@ drop_buffers(struct page *page, struct b
if (!list_empty(&bh->b_assoc_buffers))
__remove_assoc_queue(bh);
+ if (!buffer_uptodate(bh))
+ uptodate = 0;
bh = next;
} while (bh != head);
*buffers_to_free = head;
__clear_page_buffers(page);
- return 1;
+ VM_BUG_ON(page_mapped(page) && !uptodate);
+ return !uptodate;
failed:
return 0;
}
On Tue, 19 Dec 2006, Nick Piggin wrote:
>
> Now I'm not exactly sure how ext3 (or any other) filesystems make use
> of this particular feature of try_to_free_buffers(), but it is clear
> from the comments what it is for. So your patch isn't really a minimal
> fix (ie. it would require an OK from all filesystems, wouldn't it?)
>
> Or did I miss a mail where you reasoned that it is safe to make this
> change (/me goes to reread the thread)...
I'm saying it had _better_ be safe, and no, low-level filesystems don't
actually matter.
The page has to be cleanable _some_ way. So if we test for "page_dirty()"
at the top, and just refuse to do it in try_to_free_pages(), we still know
that the _proper_ page cleaning had better clean it. Because ttfp() is
never going to clean the page in the general case _anyway_.
So I'm really saying:
- the page WILL be cleaned by the real page cleaning action (ie memory
pressure or sync or something else causing us to go through the
bog-standard page-based writeout.
Does anybody dispute this?
- the "ttfp()" hack was a HACK. It was an ugly and nasty hack even when
it was first introduced. It gets doubly worse now that we know we have
something wrong with page cleaning, and it has distracted from the real
problem.
- I removed tha ugly and disgusting hack entirely at first, but Andrew
points out that he really wants to keep the buffers there, because the
buffers being clean actually say something. That, together with the
fact that as long as the page is dirty, the buffers really do end up
have a job to do, made me add a much smaller hack to replace the big
ugly one ("don't even try, if the page is marked dirty").
- so with that thing in place, there isn't even any change in behaviour
wrt the buffers and low-level filesystems. It's just that we make them
a bit harder to get rid of. But arguably that shouldn't actually ever
really _happen_ anyway (because I think it's a BUG if the page is
marked dirty but none of the buffers are), so I think that part is a
non-issue.
In other words, ttfp() _never_ had anything to do with "page cleaning".
Not originally, not with the horrible hack, and not with my patch.
Trying to mix it in just caused a bug that _everybody_ agrees is a bug.
It's not the bug we're chasing, but we've got three different patches to
fix it (Andrew's, mine and yours), and mine is the simplest one by far
especially in the long run, because it just REMOVES the ugly dependency.
And yes, I probably care more about "in the long run" than most. To me, a
bug is a bug even if it's _just_ a maintenance headache. Andrews patch
made things _worse_ ("magic insane flag"), and while yours didn't make the
code worse, it still introduced the notion of a totally insane "clean the
page but if the PTE's are dirty, do something else" notion.
IF THE PAGE TRULY IS CLEAN (and both you and Andrew claim it is, if all
buffers are clean - since you mark it clean in the non-mapped case) THEN
YOU SHOULD BE ABLE TO CLEAN THE PAGE TABLE BITS TOO.
And by claiming that the page table bits are different from PG_dirty,
you're just making the issues worse. They shouldn't be. That's what the
whole point of Peter's patch was: PG_dirty fundmentally _means_ that the
page tables might be dirty too. That was the whole _point_ of doing all
this in 2.6.19 in the first place.
So if you cannot accept that page table bits should be on "equal footing"
with PG_dirty, then you should just say "Let's remove Peter's patch
entirely".
Linus
On Tue, 19 Dec 2006, Nick Piggin wrote:
>
> Counterexample? Well AFAIKS, the clearing of PG_dirty in ttfb() in
> response to finding all buffers clean is perfectly valid. What makes
> you think otherwise?
If the page really is clean, then why the heck cant' we just clean the
page table bits too?
Either it's clean or it isn't. If all the buffers being clean means that
the page is clean, then it's clean. WE SHOULD NOT THINK THAT PTE'S ARE ANY
DIFFERENT.
I really don't see your point. Is it clean? If it is, then clear the damn
dirty bits from the page tables too. Don't go pussyfooting around the
issue and confuse yourself and everybody but me by saying "but if it's
dirty in the page tables, it's magically dirty". NO.
It really is that simple. Is it clean or not?
If it's clean, you can remove ALL the dirty bits. Not just some.
Linus
Btw,
here's a totally new tangent on this: it's possible that user code is
simply BUGGY.
There is one case where the kernel actually forcibly writes zeroes into a
file: when we're writing a page that straddles the "inode->i_size"
boundary. See the various writepages in fs/buffer.c, they all contain
variations on that theme (although most of them aren't as well commented
as this snippet):
/*
* The page straddles i_size. It must be zeroed out on each and every
* writepage invocation because it may be mmapped. "A file is mapped
* in multiples of the page size. For a file that is not a multiple of
* the page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file."
*/
kaddr = kmap_atomic(page, KM_USER0);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
Now, this should _matter_ only for user processes that are buggy, and that
have written to the page _before_ extending it with ftruncate(). That's
definitely a serious bug, but it's one that can do totally undetected
depending on when the actual write-out happens.
So what I'm saying is that if we end up writing things earlier thanks to
the more aggressive dirty-page-management thing in 2.6.19, we might
actually just expose a long-time userspace bug that was just a LOT harder
to trigger before..
I'm not saying this is the cause of all this, but we've been tearing our
hair out, and it migth be worthwhile trying this really really really
stupid patch that will notice when that happens at truncate() time, and
tell the user that he's a total idiot. Or something to that effect.
Maybe the reason this is so easy to trigger with rtorrent is not because
rtorrent does some magic pattern that triggers a kernel bug, but simply
because rtorrent itself might have a bug.
Ok, so it's a long shot, but it's still worth testing, I suspect. The
patch is very simple: whenever we do an _expanding_ truncate, we check the
last page of the _old_ size, and if there were non-zero contents past the
old size, we complain.
As an attachement is a test-program that _should_ trigger a
kernel message like
a.out: BADNESS: truncate check 17000
for good measure, just so that you can verify that the patch works and
actually catches this case.
(The 17000 number is just the one-hundred _invalid_ 0xaa bytes - out of
the 200 we wrote - that were summed up: 100*0xaa == 17000. Anything
non-zero is always a bug).
I doubt this is really it, but it's worth trying. If you fill out a page,
and only do "ftruncate()" in response to SIGBUS messages (and don't
truncate to whole pages), you could potentially see zeroes at the end of
the page exactly because _writeout_ cleared the page for you! So it
_could_ explain the symptoms, but only if user-space was horribly horribly
broken.
Linus
----
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr,KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
out_truncate:
On Tue, 19 Dec 2006, Linus Torvalds wrote:
>
> here's a totally new tangent on this: it's possible that user code is
> simply BUGGY.
Btw, here's a simpler test-program that actually shows the difference
between 2.6.18 and 2.6.19 in action, and why it could explain why a
program like rtorrent might show corruption behavious that it didn't show
before.
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char **argv)
{
char *mapping;
int fd;
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, 10) < 0)
return -1;
mapping = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (-1 == (int)(long)mapping)
return -1;
memset(mapping, 0xaa, 20);
sync();
if (ftruncate(fd, 40) < 0)
return -1;
memset(mapping + 20, 0x55, 20);
write(1, mapping, 40);
return 0;
}
Notice the "sync()" in between the "memset()" and the "ftruncate()". In
2.6.18, that would normally do absolutely _nothing_ to the shared memory
mapping, becuase we simply couldn't track pages that were dirty in the
page tables.
So in 2.6.18, if you try this, with
./a.out | od -x
you should see something like
0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050
which matches your memset() patterns: 20 bytes of 0xaa, and 20 bytes of
0x55.
HOWEVER.
In 2.6.19, because we actually track dirty data so much better, "sync()"
will actually be smart enough to write out the dirty mmap'ed data too. But
since the user program has only allocated ten bytes for it in the file,
when it is written out, the rest of the page is cleared. When you then
write the last 20 bytes (after _properly_ allocating memory for them), you
should now see a pattern like
0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000
0000020 0000 0000 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050
instead: with ten bytes of zero in between, because the data that couldn't
be written out was cleared.
So 2.6.19 is strictly _better_, but exactly because it's tracking dirty
status much more precisely, you'll see certain user-level bugs much more
easily.
NOTE NOTE NOTE! The code really _was_ buggy in 2.6.18 too, and you _can_
get the zeroes in the middle of the file with an older kernel. But in
older kernels, you need to be really really unlucky, and have the page
cleaned by strong memory pressure. In 2.6.19, any "sync()" activity
(includign from the outside) will clean the page, so a user program with
this bug can just be made to trigger the bug much more easily.
Linus
On Mon, 18 Dec 2006, Linus Torvalds wrote:
> On Tue, 19 Dec 2006, Nick Piggin wrote:
> >
> > We never want to drop dirty data! (ignoring the truncate case, which is
> > handled privately by truncate anyway)
>
> Bzzt.
>
> SURE we do.
>
> We absolutely do want to drop dirty data in the writeout path.
>
> How do you think dirty data ever _becomes_ clean data?
>
> In other words, yes, we _do_ want to test-and-clear all the pgtable bits
> _and_ the PG_dirty bit. We want to do it for:
> - writeout
> - truncate
> - possibly a "drop" event (which could be a case for a journal entry that
> becomes stale due to being replaced or something - kind of "truncate"
> on metadata)
>
> because both of those events _literally_ turn dirty state into clean
> state.
>
> In no other circumstance do we ever want to clear a dirty bit, as far as I
> can tell.
i admit this may not be entirely relevant, but it seems like a good place
to bring up an old problem: when a disk dies with lots of queued writes
it can totally bring a system to its knees... even after the disk is
removed. i wrote up something about this a while ago:
http://lkml.org/lkml/2005/8/18/243
so there's another reason to "clear a dirty bit"... well, in fact -- drop
the pages entirely.
-dean
On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
>
> On Tue, 19 Dec 2006, Linus Torvalds wrote:
> >
> > here's a totally new tangent on this: it's possible that user code is
> > simply BUGGY.
I'm sad to say this doesn't trigger :-(
* Linus Torvalds:
> Now, this should _matter_ only for user processes that are buggy,
> and that have written to the page _before_ extending it with
> ftruncate().
APT seems to properly extend the file before mapping it, by writing a
zero byte at the desired position (creating a hole).
24986 open("/var/cache/apt/pkgcache.bin", O_RDWR|O_CREAT|O_TRUNC, 0666) = 6
24986 lseek(6, 12582911, SEEK_SET) = 12582911
24986 write(6, "\0", 1) = 1
24986 mmap(NULL, 12582912, PROT_READ|PROT_WRITE, MAP_SHARED, 6, 0) = 0x2b6578636000
24986 msync(0x2b6578636000, 7464112, MS_SYNC) = 0
24986 msync(0x2b6578636000, 8656, MS_SYNC) = 0
24986 munmap(0x2b6578636000, 12582912) = 0
24986 ftruncate(6, 7464112) = 0
24986 fstat(6, {st_mode=S_IFREG|0644, st_size=7464112, ...}) = 0
24986 mmap(NULL, 7464112, PROT_READ, MAP_SHARED, 6, 0) = 0x2b6578636000
APT's code is pretty convoluted, though, and there might be some code
path in it that gets it wrong. 8-P
On Tue, 19 Dec 2006, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> >
> > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > >
> > > here's a totally new tangent on this: it's possible that user code is
> > > simply BUGGY.
>
> I'm sad to say this doesn't trigger :-(
Oh, well. It was a theory.
Linus
On Tue, 19 Dec 2006 14:51:55 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Tue, 19 Dec 2006, Peter Zijlstra wrote:
>
> > On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
> > >
> > > On Tue, 19 Dec 2006, Linus Torvalds wrote:
> > > >
> > > > here's a totally new tangent on this: it's possible that user code is
> > > > simply BUGGY.
> >
> > I'm sad to say this doesn't trigger :-(
>
> Oh, well. It was a theory.
>
Well... we'd need to see (corruption && this-not-triggering) to be sure.
Peter, have you been able to trigger the corruption?
On Wed, 2006-12-20 at 00:06 +0100, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
>
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> >
> > Peter, have you been able to trigger the corruption?
>
> Yes; however the mail I send describing that seems to be lost in space.
>
> /me quotes from the send folder:
>
> > The bad new is, that doesn't help either. The good news is I can
> > reproduce it.
> >
> > What I did to achieve that:
> >
> > - get a sizable torrent from legaltorrents.com / or create a torrent
> > yourself that is around ~600M and has multiple files.
> >
> > - start a tracker, and multiple seeds (I used three machines here)
> >
> > - pull the torrent on a fourth machine
> >
> > the seeding machines don't much matter of course.
> >
> > the fourth machine was a dual core x86-64 with an SMP kernel and
> > PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> > require writeout) and I used an ext3 partition with 1k blocks.
PS. this was a reply to:
http://lkml.org/lkml/2006/12/19/121
On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> Well... we'd need to see (corruption && this-not-triggering) to be sure.
>
> Peter, have you been able to trigger the corruption?
Yes; however the mail I send describing that seems to be lost in space.
/me quotes from the send folder:
> The bad new is, that doesn't help either. The good news is I can
> reproduce it.
>
> What I did to achieve that:
>
> - get a sizable torrent from legaltorrents.com / or create a torrent
> yourself that is around ~600M and has multiple files.
>
> - start a tracker, and multiple seeds (I used three machines here)
>
> - pull the torrent on a fourth machine
>
> the seeding machines don't much matter of course.
>
> the fourth machine was a dual core x86-64 with an SMP kernel and
> PREEMPT, mem=256M (so that the torrent is quite a bit larger and does
> require writeout) and I used an ext3 partition with 1k blocks.
On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> OR:
>
> - page_mkclean_one() is simply buggy.
GOLD!
it seems to work with all this (full diff against current git).
/me rebuilds full kernel to make sure...
reboot...
test... pff the tension...
yay, still good!
Andrei; would you please verify.
The magic seems to be in the extra tlb flush after clearing the dirty
bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 5e7cd45..2b8893b 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
spin_lock_bh(&dev->cbdev->queue_lock);
list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) {
if (cn_cb_equal(&__cbq->id.id, &msg->id)) {
- if (likely(!test_bit(WORK_STRUCT_PENDING,
- &__cbq->work.work.management) &&
+ if (likely(!delayed_work_pending(&__cbq->work) &&
__cbq->data.ddata == NULL)) {
__cbq->data.callback_priv = msg;
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..60e0945 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr, KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..f561e72 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
EXPORT_SYMBOL(test_set_page_writeback);
/*
- * Return true if any of the pages in the mapping are marged with the
+ * Return true if any of the pages in the mapping are marked with the
* passed tag.
*/
int mapping_tagged(struct address_space *mapping, int tag)
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..900229a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *ptep, entry;
spinlock_t *ptl;
int ret = 0;
@@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
if (address == -EFAULT)
goto out;
- pte = page_check_address(page, mm, address, &ptl);
- if (!pte)
+ ptep = page_check_address(page, mm, address, &ptl);
+ if (!ptep)
goto out;
- if (!pte_dirty(*pte) && !pte_write(*pte))
+ if (!pte_dirty(*ptep) && !pte_write(*ptep))
goto unlock;
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
+ entry = ptep_get_and_clear(mm, address, ptep);
entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
+ ptep_establish(vma, address, ptep, entry);
+ ret = ptep_clear_flush_dirty(vma, address, ptep) ||
+ page_test_and_clear_dirty(page);
lazy_mmu_prot_update(entry);
ret = 1;
unlock:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(ptep, ptl);
out:
return ret;
}
On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
>
> > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> >
> > Peter, have you been able to trigger the corruption?
>
> Yes; however the mail I send describing that seems to be lost in space.
Btw, can somebody actually explain the mess that is ext3 "dirtying".
Ext3 does NOT use __set_page_dirty_buffers. It does
static int ext3_journalled_set_page_dirty(struct page *page)
{
SetPageChecked(page);
return __set_page_dirty_nobuffers(page);
}
and uses that "Checked" bit as a "whole page is dirty" bit (which it tests
in "writepage()".
You realize what this all means? It means that ANYTHING that actually
clears the _real_ dirty bit won't actually be doing anything at all for
ext3, because the Checked bit will still stay set, and any IO down the
line on that page would totally ignore the dirty bits on the buffer heads
and just write out everything.
That is "The Mess(tm)".
It also basically means that anything that clears the dirty bit without
just calling "writepage()" had _better_ call "invalidatepage()" for the
whole page, because otherwise the PageChecked bit will never be cleared as
far as I can see. Happily, at least ext3 seems to _test_ for that case in
the release_page() function, so it appears that we do do this.
But this seems to just strengthen my argument: you can NEVER clean a page,
unless you (a) do IO on it immediately afterwards (writeback) or (b)
invalidate it entirely (truncate).
I'd really like to see just those two functions exist. Preferably in a
form where you can see easily that we actually follow those rules. Rather
than having a confusing set of "clear_page_dirty()" and
"test_and_clear_page_dirty()" functions that are called from random
places.
IOW, I think the "clear_page_dirty_for_io()" is fine (it's case (a))
above, and then we should probably have a "cancel_dirty_page()" function
that does all the current clear_page_dirty() but also makes sure that we
actually call the invalidate_page() function itself.
Hmm?
Linus
On Tue, 19 Dec 2006 16:03:49 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
>
> > On Tue, 2006-12-19 at 14:58 -0800, Andrew Morton wrote:
> >
> > > Well... we'd need to see (corruption && this-not-triggering) to be sure.
> > >
> > > Peter, have you been able to trigger the corruption?
> >
> > Yes; however the mail I send describing that seems to be lost in space.
>
> Btw, can somebody actually explain the mess that is ext3 "dirtying".
>
> Ext3 does NOT use __set_page_dirty_buffers. It does
>
> static int ext3_journalled_set_page_dirty(struct page *page)
> {
> SetPageChecked(page);
> return __set_page_dirty_nobuffers(page);
> }
>
> and uses that "Checked" bit as a "whole page is dirty" bit (which it tests
> in "writepage()".
This is purely for data=journal, which is rarely used.
In journalled-data mode, write(), write-fault, etc are not allowed to dirty
the pages and buffers, because the data has to be written to the journal
first. After the data has been written to the journal we only then mark
buffers (and hence pages) dirty as far as the VFS is concerned. For
checkpointing the data back to its real place on the disk.
For MAP_SHARED pages ext3 cheats madly and doesn't journal the data at all.
In all journalling modes, MAP_SHARED data follows the regular ext2-style
handling. Which is a bit of a wart.
On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > OR:
> >
> > - page_mkclean_one() is simply buggy.
>
> GOLD!
Ok. I was looking at that, and I wondered..
However, if that works, then I _think_ the correct sequence is the
following..
The rule should be:
- we flush the tlb _after_ we have cleared it, but _before_ we insert the
new entry.
But I dunno. These things are damn subtle. Does this patch fix it for you?
I actually suspect we should do this as an arch-specific macro, and
totally replace the current "ptep_clear_flush_dirty()" with one that does
"ptep_clear_flush_dirty_and_set_wp()".
Because what I'd _really_ prefer to do on x86 (and probably on most other
sane architectures) is to do
- atomically replace the pte with the EXACT SAME ONE, but one that
has the writable bit clear.
bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);
- flush the TLB, making sure that all CPU's will no longer write to it:
flush_tlb_page(vma, address);
- finally, just fetch-and-clear the dirty bit (and since it's no longer
writable, nobody should be settign it any more)
ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);
and now we should be all done.
But the "ptep_get_and_clear() + flush_tlb_page()" sequence should
hopefully also work.
Pls test.
Linus
----
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..eec8706 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
goto unlock;
entry = ptep_get_and_clear(mm, address, pte);
+ flush_tlb_page(vma, address);
entry = pte_mkclean(entry);
entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
+ set_pte_at(mm, address, pte, entry);
lazy_mmu_prot_update(entry);
ret = 1;
On 12/20/06, Linus Torvalds <[email protected]> wrote:
> On Tue, 19 Dec 2006, Linus Torvalds wrote:
> >
> > here's a totally new tangent on this: it's possible that user code is
> > simply BUGGY.
>
> Btw, here's a simpler test-program that actually shows the difference
> between 2.6.18 and 2.6.19 in action, and why it could explain why a
> program like rtorrent might show corruption behavious that it didn't show
> before.
Kinda late to the discussion, but I guess I could summarize what
rtorrent actually does, or should be doing.
When downloading a new torrent, it will create the files and truncate
them to the final size. It will never call truncate after this and the
files will remain sparse until data is downloaded. A 'piece' is mapped
to memory using MAP_SHARED, which will be page aligned on single file
torrents but unlikely to be so on multi-file torrents.
So on multi-file torrents it'll often end up with two mappings
overlapping with one page, each of which only write to their own part
the page. These will then be sync'ed with MS_ASYNC, or MS_SYNC if low
on disk space. After that it might be unmapped, then mapped as
read-only.
I haven't thought of asking if single file torrents are ok.
Rakshasa
On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:
>
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > OR:
> > >
> > > - page_mkclean_one() is simply buggy.
> >
> > GOLD!
>
> Ok. I was looking at that, and I wondered..
>
> However, if that works, then I _think_ the correct sequence is the
> following..
>
> The rule should be:
> - we flush the tlb _after_ we have cleared it, but _before_ we insert the
> new entry.
>
> But I dunno. These things are damn subtle. Does this patch fix it for you?
I will try, but I had a look around the different architectures
implementation of ptep_clear_flush_dirty() and saw that not all do the
actual flush. So if we go down this road perhaps we should introduce
another per arch function that does the potential flush. like
flush_tlb_on_clear_dirty() or something like that.
Then we could write:
entry = ptep_get_and_clear(mm, address, ptep)
flush_tlb_on_clear_dirty(vma, address);
entry = pte_mkclean(entry);
entry = pte_wrprotect(entry);
set_pte_at(mm, address, ptep, entry);
> I actually suspect we should do this as an arch-specific macro, and
> totally replace the current "ptep_clear_flush_dirty()" with one that does
> "ptep_clear_flush_dirty_and_set_wp()".
>
> Because what I'd _really_ prefer to do on x86 (and probably on most other
> sane architectures) is to do
>
> - atomically replace the pte with the EXACT SAME ONE, but one that
> has the writable bit clear.
>
> bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);
>
> - flush the TLB, making sure that all CPU's will no longer write to it:
>
> flush_tlb_page(vma, address);
>
> - finally, just fetch-and-clear the dirty bit (and since it's no longer
> writable, nobody should be settign it any more)
>
> ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);
>
> and now we should be all done.
Hmm, should we not flush after clearing the dirty bit? That is, why does
ptep_clear_flush_dirty() need a flush after clearing that bit? does it
leak through in the tlb copy?
Also, what is this page_test_and_clear_dirty() business, that seems to
be exclusively s390 btw. However they do seem to need this.
> But the "ptep_get_and_clear() + flush_tlb_page()" sequence should
> hopefully also work.
Yeah, probably, not optimally so on some archs that don't actually need
the flush though. And as above, I wonder about s390.
(added our s390 friends to the CC list)
On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:
> I will try, but I had a look around the different architectures
> implementation of ptep_clear_flush_dirty() and saw that not all do the
> actual flush. So if we go down this road perhaps we should introduce
> another per arch function that does the potential flush. like
> flush_tlb_on_clear_dirty() or something like that.
never mind, we do need an unconditional flush for changing the
protection too.
On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:
> Pls test.
Is good. Only s390 remains a question.
Another point, change_protection() also does a cache flush, should we
too?
> ----
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..eec8706 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -448,9 +448,10 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
> goto unlock;
>
> entry = ptep_get_and_clear(mm, address, pte);
flush_cache_page(vma, address, pte_pfn(entry));
> + flush_tlb_page(vma, address);
> entry = pte_mkclean(entry);
> entry = pte_wrprotect(entry);
> - ptep_establish(vma, address, pte, entry);
> + set_pte_at(mm, address, pte, entry);
> lazy_mmu_prot_update(entry);
> ret = 1;
>
>
> Hmm, should we not flush after clearing the dirty bit? That is, why does
> ptep_clear_flush_dirty() need a flush after clearing that bit? does it
> leak through in the tlb copy?
afaics you need to
1) clear
2) flush
3) check and go to 1) if needed
to be race free.
fix page_mkclean_one()
it had several issues:
- it failed to flush the cache
- it failed to flush the tlb
- it failed to do s390 (s390 guys, please verify this is now correct)
Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/rmap.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *ptep;
spinlock_t *ptl;
int ret = 0;
@@ -440,22 +440,23 @@ static int page_mkclean_one(struct page
if (address == -EFAULT)
goto out;
- pte = page_check_address(page, mm, address, &ptl);
- if (!pte)
+ ptep = page_check_address(page, mm, address, &ptl);
+ if (!ptep)
goto out;
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
-
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
+ while (pte_dirty(*ptep) || pte_write(*ptep)) {
+ pte_t entry = ptep_get_and_clear(mm, address, ptep);
+ flush_cache_page(vma, address, pte_pfn(entry));
+ flush_tlb_page(vma, address);
+ (void)page_test_and_clear_dirty(page); /* do the s390 thing */
+ entry = pte_wrprotect(entry);
+ entry = pte_mkclean(entry);
+ set_pte_at(vma, address, ptep, entry);
+ lazy_mmu_prot_update(entry);
+ ret = 1;
+ }
-unlock:
- pte_unmap_unlock(pte, ptl);
+ pte_unmap_unlock(ptep, ptl);
out:
return ret;
}
On 20/12/06, Peter Zijlstra <[email protected]> wrote:
>
> fix page_mkclean_one()
>
> it had several issues:
> - it failed to flush the cache
> - it failed to flush the tlb
> - it failed to do s390 (s390 guys, please verify this is now correct)
>
> Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> mm/rmap.c | 29 +++++++++++++++--------------
> 1 file changed, 15 insertions(+), 14 deletions(-)
>
> Index: linux-2.6/mm/rmap.c
> ===================================================================
> --- linux-2.6.orig/mm/rmap.c
> +++ linux-2.6/mm/rmap.c
> @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
> {
> struct mm_struct *mm = vma->vm_mm;
> unsigned long address;
> - pte_t *pte, entry;
> + pte_t *ptep;
> spinlock_t *ptl;
> int ret = 0;
>
> @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page
> if (address == -EFAULT)
> goto out;
>
> - pte = page_check_address(page, mm, address, &ptl);
> - if (!pte)
> + ptep = page_check_address(page, mm, address, &ptl);
> + if (!ptep)
> goto out;
>
> - if (!pte_dirty(*pte) && !pte_write(*pte))
> - goto unlock;
> -
> - entry = ptep_get_and_clear(mm, address, pte);
> - entry = pte_mkclean(entry);
> - entry = pte_wrprotect(entry);
> - ptep_establish(vma, address, pte, entry);
> - lazy_mmu_prot_update(entry);
> - ret = 1;
> + while (pte_dirty(*ptep) || pte_write(*ptep)) {
> + pte_t entry = ptep_get_and_clear(mm, address, ptep);
> + flush_cache_page(vma, address, pte_pfn(entry));
> + flush_tlb_page(vma, address);
> + (void)page_test_and_clear_dirty(page); /* do the s390 thing */
> + entry = pte_wrprotect(entry);
> + entry = pte_mkclean(entry);
> + set_pte_at(vma, address, ptep, entry);
> + lazy_mmu_prot_update(entry);
> + ret = 1;
> + }
>
Having the assignment of "ret = 1;" inside the loop seems a little
pointless. Perhaps gcc can optimize it, but still, that assignment
really only needs to happen once outside the loop.
> -unlock:
> - pte_unmap_unlock(pte, ptl);
> + pte_unmap_unlock(ptep, ptl);
> out:
> return ret;
> }
>
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
On Wed, 2006-12-20 at 12:39 +0100, Jesper Juhl wrote:
> On 20/12/06, Peter Zijlstra <[email protected]> wrote:
> >
> > fix page_mkclean_one()
> >
> > it had several issues:
> > - it failed to flush the cache
> > - it failed to flush the tlb
> > - it failed to do s390 (s390 guys, please verify this is now correct)
> >
> > Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> > mm/rmap.c | 29 +++++++++++++++--------------
> > 1 file changed, 15 insertions(+), 14 deletions(-)
> >
> > Index: linux-2.6/mm/rmap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/rmap.c
> > +++ linux-2.6/mm/rmap.c
> > @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > unsigned long address;
> > - pte_t *pte, entry;
> > + pte_t *ptep;
> > spinlock_t *ptl;
> > int ret = 0;
> >
> > @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page
> > if (address == -EFAULT)
> > goto out;
> >
> > - pte = page_check_address(page, mm, address, &ptl);
> > - if (!pte)
> > + ptep = page_check_address(page, mm, address, &ptl);
> > + if (!ptep)
> > goto out;
> >
> > - if (!pte_dirty(*pte) && !pte_write(*pte))
> > - goto unlock;
> > -
> > - entry = ptep_get_and_clear(mm, address, pte);
> > - entry = pte_mkclean(entry);
> > - entry = pte_wrprotect(entry);
> > - ptep_establish(vma, address, pte, entry);
> > - lazy_mmu_prot_update(entry);
> > - ret = 1;
> > + while (pte_dirty(*ptep) || pte_write(*ptep)) {
> > + pte_t entry = ptep_get_and_clear(mm, address, ptep);
> > + flush_cache_page(vma, address, pte_pfn(entry));
> > + flush_tlb_page(vma, address);
> > + (void)page_test_and_clear_dirty(page); /* do the s390 thing */
> > + entry = pte_wrprotect(entry);
> > + entry = pte_mkclean(entry);
> > + set_pte_at(vma, address, ptep, entry);
> > + lazy_mmu_prot_update(entry);
> > + ret = 1;
> > + }
> >
> Having the assignment of "ret = 1;" inside the loop seems a little
> pointless. Perhaps gcc can optimize it, but still, that assignment
> really only needs to happen once outside the loop.
Sure, but I was hoping gcc was smart enough. Placing it outside the loop
would require an extra if stmt. Also the chance this loop will actually
be traversed more than once is _very_ small.
On 20/12/06, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2006-12-20 at 12:39 +0100, Jesper Juhl wrote:
> > Having the assignment of "ret = 1;" inside the loop seems a little
> > pointless. Perhaps gcc can optimize it, but still, that assignment
> > really only needs to happen once outside the loop.
>
> Sure, but I was hoping gcc was smart enough. Placing it outside the loop
> would require an extra if stmt. Also the chance this loop will actually
> be traversed more than once is _very_ small.
>
allright - I just spotted it and thought I'd point it out :-)
--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
On Wed, 20 Dec 2006, Peter Zijlstra wrote:
>
> fix page_mkclean_one()
Congratulations on getting to the bottom of it, Peter (if you have:
I haven't digested enough of the thread to tell). I'm mostly offline at
present, no time for dialogue, I'll throw out a few remarks and run...
>
> it had several issues:
> - it failed to flush the cache
It's unclear to me why it should need to flush the cache, but I don't
know much about that, and mprotect does flush the cache in advance -
I think others will tell you that if it does need to be flushed, it must
be flushed while there's still a valid pte (on some arches at least).
> - it failed to flush the tlb
Eh? It flushed the TLB inside ptep_establish, didn't it?
I guess you mean you've found a race before it flushed the TLB.
> - it failed to do s390 (s390 guys, please verify this is now correct)
Hmm, I thought we cleared it with them back at the time.
>
> Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
Yikes. Well, please compare with mprotect's change_pte_range. I think
I took that as the relevant standard when checking your implementation,
and back then satisfied myself that what you were doing was equivalent.
If page_mkclean_one is now agreed to be significantly defective, then
I suspect change_pte_range is also; perhaps others too.
(But I haven't found time to do more than skim through the thread,
I've not thought through the issues at all: I am surprised that it's
now found defective, we looked at it long and hard back then.)
And trivial point: please undo those distracting "pte" to "ptep" mods:
if you want to call pte pointers ptep, throughout rmap.c and throughout
mm, that's another patch entirely (which I won't welcome, but others may).
Hugh
On Wed, 2006-12-20 at 13:00 +0000, Hugh Dickins wrote:
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> >
> > fix page_mkclean_one()
>
> Congratulations on getting to the bottom of it, Peter (if you have:
> I haven't digested enough of the thread to tell).
Well, I thought I understood, you just shattered that.
> I'm mostly offline at
> present, no time for dialogue, I'll throw out a few remarks and run...
I wondered where you were ;-) Enjoy your time away from the computer.
> >
> > it had several issues:
> > - it failed to flush the cache
>
> It's unclear to me why it should need to flush the cache, but I don't
> know much about that, and mprotect does flush the cache in advance -
> I think others will tell you that if it does need to be flushed,
I was still thinking about why exactly, but indeed since mprotect does I
thought it prudent to also do it.
> it must
> be flushed while there's still a valid pte (on some arches at least).
Ah, good point, makes sense I guess.
> > - it failed to flush the tlb
>
> Eh? It flushed the TLB inside ptep_establish, didn't it?
> I guess you mean you've found a race before it flushed the TLB.
Hmm, quite right indeed. I missed that. So moving the flush inside the
pte cleared section closed a race. It seems I must have a long hard look
at these architecture manuals...
> > - it failed to do s390 (s390 guys, please verify this is now correct)
>
> Hmm, I thought we cleared it with them back at the time.
/me queries mail folder...
can't seem to find it.
> >
> > Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
>
> Yikes. Well, please compare with mprotect's change_pte_range. I think
> I took that as the relevant standard when checking your implementation,
> and back then satisfied myself that what you were doing was equivalent.
> If page_mkclean_one is now agreed to be significantly defective, then
> I suspect change_pte_range is also; perhaps others too.
Arjan argued that mprotect and msync would mostly race with themselves
in userspace.
> (But I haven't found time to do more than skim through the thread,
> I've not thought through the issues at all: I am surprised that it's
> now found defective, we looked at it long and hard back then.)
---
page_mkclean_one() fix
it had several issues:
- it failed to flush the cache
- a race wrt tlb flushing
- it failed to do s390 (s390 guys, please verify this is now correct)
Also, clear in a loop to ensure SMP safeness as suggested by Arjan.
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/rmap.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *pte;
spinlock_t *ptl;
int ret = 0;
@@ -444,17 +444,20 @@ static int page_mkclean_one(struct page
if (!pte)
goto out;
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
+ while (pte_dirty(*pte) || pte_write(*pte)) {
+ pte_t entry;
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ entry = ptep_get_and_clear(mm, address, pte);
+ flush_tlb_page(vma, address);
+ (void)page_test_and_clear_dirty(page); /* do the s390 thing */
+ entry = pte_wrprotect(entry);
+ entry = pte_mkclean(entry);
+ set_pte_at(vma, address, pte, entry);
+ lazy_mmu_prot_update(entry);
+ ret = 1;
+ }
-unlock:
pte_unmap_unlock(pte, ptl);
out:
return ret;
On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
>
> > OR:
> >
> > - page_mkclean_one() is simply buggy.
>
> GOLD!
>
> it seems to work with all this (full diff against current git).
>
> /me rebuilds full kernel to make sure...
> reboot...
> test... pff the tension...
> yay, still good!
>
> Andrei; would you please verify.
I have corrupted files.
> The magic seems to be in the extra tlb flush after clearing the dirty
> bit. Just too bad ptep_clear_flush_dirty() needs ptep not entry.
>
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 5e7cd45..2b8893b 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -135,8 +135,7 @@ static int cn_call_callback(struct cn_msg *msg, void (*destruct_data)(void *), v
> spin_lock_bh(&dev->cbdev->queue_lock);
> list_for_each_entry(__cbq, &dev->cbdev->queue_list, callback_entry) {
> if (cn_cb_equal(&__cbq->id.id, &msg->id)) {
> - if (likely(!test_bit(WORK_STRUCT_PENDING,
> - &__cbq->work.work.management) &&
> + if (likely(!delayed_work_pending(&__cbq->work) &&
> __cbq->data.ddata == NULL)) {
> __cbq->data.callback_priv = msg;
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
> int ret = 0;
>
> BUG_ON(!PageLocked(page));
> - if (PageWriteback(page))
> + if (PageDirty(page) || PageWriteback(page))
> return 0;
>
> if (mapping == NULL) { /* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
> - if (ret) {
> - /*
> - * If the filesystem writes its buffers by hand (eg ext3)
> - * then we can have clean buffers against a dirty page. We
> - * clean the page here; otherwise later reattachment of buffers
> - * could encounter a non-uptodate page, which is unresolvable.
> - * This only applies in the rare case where try_to_free_buffers
> - * succeeds but the page is not freed.
> - *
> - * Also, during truncate, discard_buffer will have marked all
> - * the page's buffers clean. We discover that here and clean
> - * the page also.
> - */
> - if (test_clear_page_dirty(page))
> - task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> - }
> out:
> if (buffers_to_free) {
> struct buffer_head *bh = buffers_to_free;
> diff --git a/mm/memory.c b/mm/memory.c
> index c00bac6..60e0945 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
> }
> EXPORT_SYMBOL(unmap_mapping_range);
>
> +static void check_last_page(struct address_space *mapping, loff_t size)
> +{
> + pgoff_t index;
> + unsigned int offset;
> + struct page *page;
> +
> + if (!mapping)
> + return;
> + offset = size & ~PAGE_MASK;
> + if (!offset)
> + return;
> + index = size >> PAGE_SHIFT;
> + page = find_lock_page(mapping, index);
> + if (page) {
> + unsigned int check = 0;
> + unsigned char *kaddr = kmap_atomic(page, KM_USER0);
> + do {
> + check += kaddr[offset++];
> + } while (offset < PAGE_SIZE);
> + kunmap_atomic(kaddr, KM_USER0);
> + unlock_page(page);
> + page_cache_release(page);
> + if (check)
> + printk(KERN_ERR "%s: BADNESS: truncate check %u\n", current->comm, check);
> + }
> +}
> +
> /**
> * vmtruncate - unmap mappings "freed" by truncate() syscall
> * @inode: inode of the file used
> @@ -1875,6 +1902,7 @@ do_expand:
> goto out_sig;
> if (offset > inode->i_sb->s_maxbytes)
> goto out_big;
> + check_last_page(mapping, inode->i_size);
> i_size_write(inode, offset);
>
> out_truncate:
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 237107c..f561e72 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -957,7 +957,7 @@ int test_set_page_writeback(struct page *page)
> EXPORT_SYMBOL(test_set_page_writeback);
>
> /*
> - * Return true if any of the pages in the mapping are marged with the
> + * Return true if any of the pages in the mapping are marked with the
> * passed tag.
> */
> int mapping_tagged(struct address_space *mapping, int tag)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index d8a842a..900229a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
> {
> struct mm_struct *mm = vma->vm_mm;
> unsigned long address;
> - pte_t *pte, entry;
> + pte_t *ptep, entry;
> spinlock_t *ptl;
> int ret = 0;
>
> @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
> if (address == -EFAULT)
> goto out;
>
> - pte = page_check_address(page, mm, address, &ptl);
> - if (!pte)
> + ptep = page_check_address(page, mm, address, &ptl);
> + if (!ptep)
> goto out;
>
> - if (!pte_dirty(*pte) && !pte_write(*pte))
> + if (!pte_dirty(*ptep) && !pte_write(*ptep))
> goto unlock;
>
> - entry = ptep_get_and_clear(mm, address, pte);
> - entry = pte_mkclean(entry);
> + entry = ptep_get_and_clear(mm, address, ptep);
> entry = pte_wrprotect(entry);
> - ptep_establish(vma, address, pte, entry);
> + ptep_establish(vma, address, ptep, entry);
> + ret = ptep_clear_flush_dirty(vma, address, ptep) ||
> + page_test_and_clear_dirty(page);
> lazy_mmu_prot_update(entry);
> ret = 1;
>
> unlock:
> - pte_unmap_unlock(pte, ptl);
> + pte_unmap_unlock(ptep, ptl);
> out:
> return ret;
> }
>
>
On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> >
> > > OR:
> > >
> > > - page_mkclean_one() is simply buggy.
> >
> > GOLD!
> >
> > it seems to work with all this (full diff against current git).
> >
> > /me rebuilds full kernel to make sure...
> > reboot...
> > test... pff the tension...
> > yay, still good!
> >
> > Andrei; would you please verify.
>
> I have corrupted files.
drad; and with this patch:
http://lkml.org/lkml/2006/12/20/112
/me goes rebuild his kernel and try more than 3 times
On Wed, 2006-12-20 at 10:01 +0100, Peter Zijlstra wrote:
> Also, what is this page_test_and_clear_dirty() business, that seems to
> be exclusively s390 btw. However they do seem to need this.
>
> > But the "ptep_get_and_clear() + flush_tlb_page()" sequence should
> > hopefully also work.
>
> Yeah, probably, not optimally so on some archs that don't actually need
> the flush though. And as above, I wonder about s390.
Simple, the s390 architecture does not keep the dirty bit in the pte but
in something called the storage key. For each physical page there is one
associated storage key. It is accessed with special instructions like
"iske", "sske" or "rrbe". To clear the dirty bit the storage key of a
page is read with iske, the bit is cleared and the storage key is stored
back with sske. That means that clearing the dirty bit is not an atomic
operation. rrbe is used to test and clear the referenced bit (young/old
infomation) and is atomic in regard to other storage key operations. If
you think about it, the storage keys are quite nice for the operating
system, page_referenced() can be implemented with a single test
"page_test_and_clear_young()". No need to read all the ptes pointing to
the page. The downside is that the storage keys have a cost on the
hardware side.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Wed, 2006-12-20 at 12:26 +0100, Peter Zijlstra wrote:
> fix page_mkclean_one()
>
> it had several issues:
> - it failed to flush the cache
> - it failed to flush the tlb
> - it failed to do s390 (s390 guys, please verify this is now correct)
Sorry, page_mkclean is broken for s390. But it has already been broken
before your change. It is only more broken now.
> @@ -440,22 +440,23 @@ static int page_mkclean_one(struct page
> if (address == -EFAULT)
> goto out;
>
> - pte = page_check_address(page, mm, address, &ptl);
> - if (!pte)
> + ptep = page_check_address(page, mm, address, &ptl);
> + if (!ptep)
> goto out;
>
> - if (!pte_dirty(*pte) && !pte_write(*pte))
> - goto unlock;
> -
> - entry = ptep_get_and_clear(mm, address, pte);
> - entry = pte_mkclean(entry);
> - entry = pte_wrprotect(entry);
> - ptep_establish(vma, address, pte, entry);
> - lazy_mmu_prot_update(entry);
> - ret = 1;
> + while (pte_dirty(*ptep) || pte_write(*ptep)) {
> + pte_t entry = ptep_get_and_clear(mm, address, ptep);
> + flush_cache_page(vma, address, pte_pfn(entry));
> + flush_tlb_page(vma, address);
> + (void)page_test_and_clear_dirty(page); /* do the s390 thing */
> + entry = pte_wrprotect(entry);
> + entry = pte_mkclean(entry);
> + set_pte_at(vma, address, ptep, entry);
> + lazy_mmu_prot_update(entry);
> + ret = 1;
> + }
>
> -unlock:
> - pte_unmap_unlock(pte, ptl);
> + pte_unmap_unlock(ptep, ptl);
> out:
> return ret;
> }
1) pte_dirty() is always false. The reason is that s390 keeps the dirty
bit information in the storage key and not the pte. If pte_write is
false as well nothing is done. There really should be a
if (page_test_and_clear_dirty(page))
ret = 1;
at the end of page_mkclean.
2) Please use ptep_clear_flush instead of ptep_get_and_clear +
flush_tlb_page. The former uses an optimization on s390 that flushes
just one TLB, the later flushes every TLB of the current mm.
My try to fix this up is attached. It moves the flush_cache_page after
the flush_tlb_page (see asm-generic/pgtable.h for the generic definition
of ptep_clear_flush that is used for i386). I hope this doesn't break
anything else.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
---
mm/rmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff -urpN linux-2.6/mm/rmap.c linux-2.6-mkclean/mm/rmap.c
--- linux-2.6/mm/rmap.c 2006-12-20 15:49:01.000000000 +0100
+++ linux-2.6-mkclean/mm/rmap.c 2006-12-20 15:51:14.000000000 +0100
@@ -445,10 +445,8 @@ static int page_mkclean_one(struct page
goto out;
while (pte_dirty(*ptep) || pte_write(*ptep)) {
- pte_t entry = ptep_get_and_clear(mm, address, ptep);
+ pte_t entry = ptep_clear_flush(vma, address, ptep);
flush_cache_page(vma, address, pte_pfn(entry));
- flush_tlb_page(vma, address);
- (void)page_test_and_clear_dirty(page); /* do the s390 thing */
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(vma, address, ptep, entry);
@@ -490,6 +488,8 @@ int page_mkclean(struct page *page)
if (mapping)
ret = page_mkclean_file(mapping, page);
}
+ if (page_test_and_clear_dirty(page))
+ ret = 1;
return ret;
}
On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > >
> > > > OR:
> > > >
> > > > - page_mkclean_one() is simply buggy.
> > >
> > > GOLD!
> > >
> > > it seems to work with all this (full diff against current git).
> > >
> > > /me rebuilds full kernel to make sure...
> > > reboot...
> > > test... pff the tension...
> > > yay, still good!
> > >
> > > Andrei; would you please verify.
> >
> > I have corrupted files.
>
> drad; and with this patch:
> http://lkml.org/lkml/2006/12/20/112
Hash check on download completion found bad chunks, consider using
"safe_sync".
>
> /me goes rebuild his kernel and try more than 3 times
>
On Wed, 2006-12-20 at 18:30 +0200, Andrei Popa wrote:
> On Wed, 2006-12-20 at 15:23 +0100, Peter Zijlstra wrote:
> > On Wed, 2006-12-20 at 16:15 +0200, Andrei Popa wrote:
> > > On Wed, 2006-12-20 at 00:42 +0100, Peter Zijlstra wrote:
> > > > On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
> > > >
> > > > > OR:
> > > > >
> > > > > - page_mkclean_one() is simply buggy.
> > > >
> > > > GOLD!
> > > >
> > > > it seems to work with all this (full diff against current git).
> > > >
> > > > /me rebuilds full kernel to make sure...
> > > > reboot...
> > > > test... pff the tension...
> > > > yay, still good!
> > > >
> > > > Andrei; would you please verify.
> > >
> > > I have corrupted files.
> >
> > drad; and with this patch:
> > http://lkml.org/lkml/2006/12/20/112
>
> Hash check on download completion found bad chunks, consider using
> "safe_sync".
*sigh* back to square 1.
and I need to look at my reproduction case ;-(
Thanks for testing.
* Peter Zijlstra <[email protected]> [2006-12-20 14:56]:
> page_mkclean_one() fix
This patch doesn't fix my problem (apt segfaults on ARM because its
database is corrupted).
--
Martin Michlmayr
http://www.cyrius.com/
On Wed, 20 Dec 2006, Martin Michlmayr wrote:
> * Peter Zijlstra <[email protected]> [2006-12-20 14:56]:
> > page_mkclean_one() fix
>
> This patch doesn't fix my problem (apt segfaults on ARM because its
> database is corrupted).
Can you remind us:
- your ARM is UP, right? Do you have PREEMPT on?
- This is probably a stupid question, but you did make sure that the
database was ok (with some rebuild command) and that you didn't have
preexisting corruption?
Anyway, the page_mkclean_one() fixes (along with _most_ things we've
looked at) shouldn't matter on UP, at least certainly not without PREEMPT.
Linus
* Linus Torvalds <[email protected]> [2006-12-20 09:35]:
> Can you remind us:
> - your ARM is UP, right? Do you have PREEMPT on?
It's UP and PREEMPT is not set. I used 2.6.19 plus the patch that has
been posted.
> - This is probably a stupid question, but you did make sure that the
> database was ok (with some rebuild command) and that you didn't have
> preexisting corruption?
Yes, my test case is to install Debian on the ARM machine so the
database is created fresh. While the corruption always triggers
during a fresh installation, it's much harder to see in a running
system. Some people see it on their system but I haven't found a 100%
working recipe to reproduce it yet given a working system; doing a new
installation seems to trigger it all the time though.
> Anyway, the page_mkclean_one() fixes (along with _most_ things we've
> looked at) shouldn't matter on UP, at least certainly not without
> PREEMPT.
Hmm. So what about UP without PREEMPT then...
Maybe the following information is helpful in some way: remember how I
said that we have applied 6 mm patches to 2.6.18 in Debian? According
to Gordon Farquharson, who's helping me a great deal with testing
installation on this ARM machine (Linksys NSLU2), the corruption
doesn't always show up when you only apply
mm-tracking-shared-dirty-pages.patch to 2.6.18 but it shows up all the
time with all six patches applied. As a reminder, the 6 patches we
apply are:
mm-tracking-shared-dirty-pages.patch
mm-balance-dirty-pages.patch
mm-optimize-mprotect.patch
mm-install_page-cleanup.patch
mm-do_wp_page-fixup.patch
mm-msync-cleanup.patch
--
Martin Michlmayr
http://www.cyrius.com/
Peter Zijlstra wrote:
>On Tue, 2006-12-19 at 10:59 -0800, Linus Torvalds wrote:
>
>
>>On Tue, 19 Dec 2006, Linus Torvalds wrote:
>>
>>
>>> here's a totally new tangent on this: it's possible that user code is
>>>simply BUGGY.
>>>
>>>
>
>I'm sad to say this doesn't trigger :-(
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
Hi all,
I ran it a number of times on 2.6.16-1.2115_FC4 and always got
./a.out | od -x
0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
but running it on 2.6.19-rc5 I always get zeros in the middle.
Steve
--
"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)
"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)
On Wed, 20 Dec 2006, Martin Michlmayr wrote:
>
> > Anyway, the page_mkclean_one() fixes (along with _most_ things we've
> > looked at) shouldn't matter on UP, at least certainly not without
> > PREEMPT.
>
> Hmm. So what about UP without PREEMPT then...
So that's why I've been harping on the fact that I think we simply do
really wrong things with PG_dirty at times, and that I find it confusing
that there's
- clear_page_dirty_for_io(): this one makes sense. The name makes sense,
and the implementation makes sense (which is _not_ the same thing as
"works", of course - "makes sense" does not mean "no bugs" ;).
- test_clear_page_dirty: this one makes no sense WHATSOEVER, except as a
buggy way to do the "_for_io()" case.. This makes sense neither from a
concept angle _or_ an implementation angle (the whole "test_" part is
nonsense: why would anybody care? What operation does this? What can it
do if the page is dirty? It also has no sensible thing it can do to the
page tables.
- clear_page_dirty(): this one makes sense only as a "cancel" operation,
for vmtruncate and friends (it's different from the "_for_io()" case in
several ways:
(a) we should have unmapped such pages forcibly _anyway_, so
looking at the PTE's make no sense.
(b) because we're not starting IO, we don't have the "mark for
writeback" case, and we need to clear the dirty tags from the
radix trees etc since the writeback logic won't do it for us.
The _implementation_ of "clear_page_dirty()" doesn't make sense, but
the concept does.
I've repeated that theory a few times, but neither Andrew nor Nick seem to
really believe in it. So I'll just repeat it once more, only to be shot
down. I think we have three operations, one of which is totally idiotic
and senseless, and one of which is just badly implemented.
> Maybe the following information is helpful in some way: remember how I
> said that we have applied 6 mm patches to 2.6.18 in Debian? According
> to Gordon Farquharson, who's helping me a great deal with testing
> installation on this ARM machine (Linksys NSLU2), the corruption
> doesn't always show up when you only apply
> mm-tracking-shared-dirty-pages.patch to 2.6.18 but it shows up all the
> time with all six patches applied.
I think the "it hapepns occasionally with just the first patch" is the
really important part. The other patches really are likely to just change
writeback timing behaviour (_especially_ the "tracking-shared-dirty-pages"
patch), but if it happens occasionally even with the first one, that's the
one that almost certainly introduced the real problem.
And my argument above is actually that the "real problem" goes a hell of a
lot further back in time, but it didn't use to be a problem because we
just considered dirty bits in the page tables to be something _completely_
independent of the "page dirty" status, so historically, it just didn't
matter that we had insane implementations and senseless operations.
Linus
On Wed, 20 Dec 2006, Linus Torvalds wrote:
>
> So that's why I've been harping on the fact that I think we simply do
> really wrong things with PG_dirty at times [ ... ]
Ok, I'll just put my money where my mouth is, and suggest a patch like
THIS instead.
This one clears up all the issues I find irritating:
- "test_clear_page_dirty()" is insane, both conceptually and as an
implementation. "Give me a 'C', give me an 'R', give me an 'A', give me
a 'P'".
So rip out that mindfart entirely.
- "clear_page_dirty()" is badly named, and should be about CANCELLING the
dirty bit, and must never be called with pages mapped anyway. So throw
that out too, and replace it with a new function:
void cancel_dirty_page(struct page *page, unsigned int accounting_size);
- "clear_page_dirty_for_io()" is fine.
And with that, I then either rip out any old users of
"test_clear_page_dirty()" or "clear_page_dirty()", and if appropriate (and
it's realy lonly appropriate for "truncate()", I replace them with the new
"cancel_dirty_page()". Most of the time, they should just be deleted
entirely.
NOTE NOTE NOTE! I _only_ did enough to make things compile for my
particular configuration. That means that right now the following
filesystems are broken with this patch (because they use the totally
broken old crap):
CIFS, FUSE, JFS, ReiserFS, XFS
and I don't know exactly what they need to be fixed. But most likely their
usage was insane and pointless anyway (looking at the ReiserFS case, for
example, that was DEFINITELY the case. I can't even imagine what the heck
it thinks it is doing).
Anyway, I'm not at all guaranteeing that this solves anything at all. I
_do_ guarantee that this is a h*ll of a lot saner than what we had before.
[ This also includes a few of my older patches, I didn't bother to sort
them out, and the fs/buffer.c patch is required because it got rid of
one of the insane uses of test_clear_page_dirty().
So this goes directly on top of current -git, with no other changes in
the tree. ]
Nick, Hugh, Peter, Andrew? Comments?
Martin, Andrei, does this make any difference for your corruption cases?
Linus
---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..4f4cd13 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct file *file,
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..350878a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struct page *page)
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
-{
- test_clear_page_dirty(page);
-}
-
static inline void set_page_writeback(struct page *page)
{
test_set_page_writeback(page);
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr,KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..b3a198c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *page)
EXPORT_SYMBOL(set_page_dirty_lock);
/*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
- struct address_space *mapping = page_mapping(page);
- unsigned long flags;
-
- if (!mapping)
- return TestClearPageDirty(page);
-
- write_lock_irqsave(&mapping->tree_lock, flags);
- if (TestClearPageDirty(page)) {
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- /*
- * We can continue to use `mapping' here because the
- * page is locked, which pins the address_space
- */
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
- dec_zone_page_state(page, NR_FILE_DIRTY);
- }
- return 1;
- }
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- return 0;
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..bf9e296 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -51,6 +51,20 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
do_invalidatepage(page, partial);
}
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+ /* If we're cancelling the page, it had better not be mapped any more */
+ if (page_mapped(page)) {
+ static unsigned int warncount;
+
+ WARN_ON(++warncount < 5);
+ }
+
+ if (TestClearPageDirty(page) && account_size)
+ task_io_account_cancelled_write(account_size);
+}
+
+
/*
* If truncate cannot remove the fs-private metadata from the page, the page
* becomes anonymous. It will be left on the LRU and may even be mapped into
@@ -70,8 +84,8 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
remove_from_page_cache(page);
@@ -350,7 +364,6 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
- int was_dirty;
lock_page(page);
if (page->mapping != mapping) {
@@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page2(mapping, page)) {
- if (was_dirty)
- set_page_dirty(page);
+ if (!invalidate_complete_page2(mapping, page))
ret = -EIO;
- }
unlock_page(page);
}
pagevec_release(&pvec);
On Wed, 2006-12-20 at 11:50 -0800, Linus Torvalds wrote:
> Nick, Hugh, Peter, Andrew? Comments?
Hooray! I'm all for this cleanup. Let us see where this road leads..
On Wed, 2006-12-20 at 11:50 -0800, Linus Torvalds wrote:
> NOTE NOTE NOTE! I _only_ did enough to make things compile for my
> particular configuration. That means that right now the following
> filesystems are broken with this patch (because they use the totally
> broken old crap):
>
> CIFS, FUSE, JFS, ReiserFS, XFS
>
> and I don't know exactly what they need to be fixed. But most likely their
> usage was insane and pointless anyway (looking at the ReiserFS case, for
> example, that was DEFINITELY the case. I can't even imagine what the heck
> it thinks it is doing).
Here's a patch to get rid of clear_page_dirty() from jfs. I'm not
convinced it was totally broken, but I'm not convinced it wasn't.
Either way, I don't think that bit of code was particularly beneficial.
Feel free to apply this patch independent of your patch if you really
think that jfs's use of clear_page_dirty is crap, or I can push it
through -mm first.
This patch removes some questionable code that attempted to make a
no-longer-used page easier to reclaim.
Calling metapage_writepage against such a page will not result in any
I/O being performed, so removing this code shouldn't be a big deal.
Signed-off-by: Dave Kleikamp <[email protected]>
diff -Nurp linux-orig/fs/jfs/jfs_metapage.c linux/fs/jfs/jfs_metapage.c
--- linux-orig/fs/jfs/jfs_metapage.c 2006-12-07 17:12:58.000000000 -0600
+++ linux/fs/jfs/jfs_metapage.c 2006-12-20 15:19:48.000000000 -0600
@@ -764,22 +764,9 @@ void release_metapage(struct metapage *
} else if (mp->lsn) /* discard_metapage doesn't remove it */
remove_from_logsync(mp);
-#if MPS_PER_PAGE == 1
- /*
- * If we know this is the only thing in the page, we can throw
- * the page out of the page cache. If pages are larger, we
- * don't want to do this.
- */
-
- /* Retest mp->count since we may have released page lock */
- if (test_bit(META_discard, &mp->flag) && !mp->count) {
- clear_page_dirty(page);
- ClearPageUptodate(page);
- }
-#else
/* Try to keep metapages from using up too much memory */
drop_metapage(page, mp);
-#endif
+
unlock_page(page);
page_cache_release(page);
}
On Wed, Dec 20, 2006 at 06:03:23PM +0100, Martin Michlmayr wrote:
> * Peter Zijlstra <[email protected]> [2006-12-20 14:56]:
> > page_mkclean_one() fix
>
> This patch doesn't fix my problem (apt segfaults on ARM because its
> database is corrupted).
Are you using IDE in PIO mode? If so, the bug probably lies there.
As I've said repeatedly when asked by IDE folk to test their PIO-based
cache coherency fixes, I am unable to reproduce the bug, ergo I am
unable to test the fix.
(Some people, such as Jeff Garzik to name names, took that as me being
entirely unreasonable and un-cooperative. But consider carefully - how
can _anyone_ test something that they can't produce. I consider Jeff's
comments extremely very childish in that respect.)
Hence, as far as I'm aware, Linux on PIO-based IDE ARM hardware remains
utterly *unsafe*.
Sorry.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
I think this is also needed:
---
mm/truncate.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -320,19 +320,14 @@ invalidate_complete_page2(struct address
if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
lock_page_ref_irq(page);
- if (PageDirty(page))
- goto failed;
-
BUG_ON(PagePrivate(page));
__remove_from_page_cache(page);
unlock_page_ref_irq(page);
ClearPageUptodate(page);
page_cache_release(page); /* pagecache ref */
return 1;
-failed:
- unlock_page_ref_irq(page);
- return 0;
}
/**
On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote:
> I think this is also needed:
See also:
http://marc.theaimsgroup.com/?l=linux-kernel&m=116603599904278&w=2
> ---
> mm/truncate.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -320,19 +320,14 @@ invalidate_complete_page2(struct address
> if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
> return 0;
>
> + cancel_dirty_page(page, PAGE_CACHE_SIZE);
> lock_page_ref_irq(page);
> - if (PageDirty(page))
> - goto failed;
> -
> BUG_ON(PagePrivate(page));
> __remove_from_page_cache(page);
> unlock_page_ref_irq(page);
> ClearPageUptodate(page);
> page_cache_release(page); /* pagecache ref */
> return 1;
> -failed:
> - unlock_page_ref_irq(page);
> - return 0;
> }
>
> /**
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, 20 Dec 2006, Dave Kleikamp wrote:
>
> This patch removes some questionable code that attempted to make a
> no-longer-used page easier to reclaim.
If so, "cancel_dirty_page()" may actually be the right thing to use, but
only if you can guarantee that the page isn't mapped anywhere (and from
the name of the function I guess it's not something that you'll ever map?)
So the JFS code _looks_ like you could just replace the
clear_page_dirty(page);
with
cancel_dirty_page(page, PAGE_CACHE_SIZE);
(where that second parameter is just used for statistics - it updates the
"cancelled IO" byte-counts if CONFIG_TASK_IO_ACCOUNTING is set - so the
number doesn't really matter, you could make it zero if you never want the
thing to show up in the IO accounting).
Linus
On Wed, 20 Dec 2006, Peter Zijlstra wrote:
>
> I think this is also needed:
Yeah, that looks about right. Although I think it should go above the
"try_to_release_page()", because right now we do that "ttrp()" with the
dirty bit set, and we should let the low-level filesystem just know that
it's simply not interesting any more (and, indeed, "try_to_free_buffers()"
too, for that matter).
Anyway, I think that's a detail. I'd rather know whether this all actually
makes any difference what-so-ever to the corruption behaviour of Andrei
&co.
Maybe the UP ARM case is some strange dcache alias issue with PIO IDE, and
the only reason that started showing up at the same time is the different
IO loads. Who knows.
[ Although I think you may have been on the right track with that dcache
flushing stuff in "page_mkclean()".. It might not have been quite
all there, but I think we should go back and look very closely at
page_mkclean() regardless of any other issues! ]
So far, my whole "cancel_dirty_page/clean_page_dirty_for_io" patch has
really been just a "try to make the code _look_ sane. I don't think we
have a single report that the patch actually makes any difference yet.
Linus
On Wed, 2006-12-20 at 14:25 -0800, Linus Torvalds wrote:
>
> On Wed, 20 Dec 2006, Dave Kleikamp wrote:
> >
> > This patch removes some questionable code that attempted to make a
> > no-longer-used page easier to reclaim.
>
> If so, "cancel_dirty_page()" may actually be the right thing to use, but
> only if you can guarantee that the page isn't mapped anywhere (and from
> the name of the function I guess it's not something that you'll ever map?)
That's correct. It can't be mapped. It's a private mapping only used
for metadata.
I'm really not sure the code in question is having the intended effect.
Maybe one of the gurus on cc: can take a look at the code and tell me if
it's worth keeping. I apologize in advance if it makes anyone lose
their lunch.
> So the JFS code _looks_ like you could just replace the
>
> clear_page_dirty(page);
>
> with
>
> cancel_dirty_page(page, PAGE_CACHE_SIZE);
>
> (where that second parameter is just used for statistics - it updates the
> "cancelled IO" byte-counts if CONFIG_TASK_IO_ACCOUNTING is set - so the
> number doesn't really matter, you could make it zero if you never want the
> thing to show up in the IO accounting).
I'm not sure whether zero or PAGE_CACHE_SIZE would be better. The
situation is where some page of metadata is no longer used, say
shrinking a directory tree or truncating a file and throwing out the
extent tree.
Thanks,
Shaggy
--
David Kleikamp
IBM Linux Technology Center
On Wed, 2006-12-20 at 14:49 -0800, Linus Torvalds wrote:
>
> On Wed, 20 Dec 2006, Peter Zijlstra wrote:
> >
> > I think this is also needed:
>
> Yeah, that looks about right. Although I think it should go above the
> "try_to_release_page()", because right now we do that "ttrp()" with the
> dirty bit set, and we should let the low-level filesystem just know that
> it's simply not interesting any more (and, indeed, "try_to_free_buffers()"
> too, for that matter).
That makes NFS unhappy, see nfs_release_page().
> Anyway, I think that's a detail. I'd rather know whether this all actually
> makes any difference what-so-ever to the corruption behaviour of Andrei
> &co.
Yeah, I have to tinker with my test setup to make it fail again. Maybe I
have to add more seeds, that seemed to make a difference, it was
impossible to trigger with a single seed.
FWIW I also added some scribble past i_size checks in nobh_writepage()
and block_write_full_page().
FWIW2 I straced rtorrent for a bit and it does an aweful lot of mmap
calls and relatively few msync(MS_ASYNC);munmap(), and no truncate apart
from creating sparse files at the beginning.
> Maybe the UP ARM case is some strange dcache alias issue with PIO IDE, and
> the only reason that started showing up at the same time is the different
> IO loads. Who knows.
>
> [ Although I think you may have been on the right track with that dcache
> flushing stuff in "page_mkclean()".. It might not have been quite
> all there, but I think we should go back and look very closely at
> page_mkclean() regardless of any other issues! ]
current version
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/rmap.c | 23 +++++++++++++----------
1 file changed, 13 insertions(+), 10 deletions(-)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *pte;
spinlock_t *ptl;
int ret = 0;
@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
if (!pte)
goto out;
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
+ while (pte_dirty(*pte) || pte_write(*pte)) {
+ pte_t entry;
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ entry = ptep_clear_flush(vma, address, pte);
+ entry = pte_wrprotect(entry);
+ entry = pte_mkclean(entry);
+ ptep_establish(vma, address, pte, entry);
+ lazy_mmu_prot_update(entry);
+ ret = 1;
+ }
-unlock:
pte_unmap_unlock(pte, ptl);
out:
return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
if (mapping)
ret = page_mkclean_file(mapping, page);
}
+ if (page_test_and_clear_dirty(page))
+ ret = 1;
return ret;
}
> So far, my whole "cancel_dirty_page/clean_page_dirty_for_io" patch has
> really been just a "try to make the code _look_ sane. I don't think we
> have a single report that the patch actually makes any difference yet.
I failed to compile a kernel with that patch (100% iowait and a bunch of
processes stuck in D state), but sysrq-t was borked (only numbers no
symbols) have yet to retry - I noticed you kicked the unwinder?.
On Wed, Dec 20, 2006 at 11:50:50AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 20 Dec 2006, Linus Torvalds wrote:
> >
> > So that's why I've been harping on the fact that I think we simply do
> > really wrong things with PG_dirty at times [ ... ]
>
> Ok, I'll just put my money where my mouth is, and suggest a patch like
> THIS instead.
>
> This one clears up all the issues I find irritating:
>
> - "test_clear_page_dirty()" is insane, both conceptually and as an
> implementation. "Give me a 'C', give me an 'R', give me an 'A', give me
> a 'P'".
>
> So rip out that mindfart entirely.
>
> - "clear_page_dirty()" is badly named, and should be about CANCELLING the
> dirty bit, and must never be called with pages mapped anyway. So throw
> that out too, and replace it with a new function:
>
> void cancel_dirty_page(struct page *page, unsigned int accounting_size);
>
> - "clear_page_dirty_for_io()" is fine.
>
> And with that, I then either rip out any old users of
> "test_clear_page_dirty()" or "clear_page_dirty()", and if appropriate (and
> it's realy lonly appropriate for "truncate()", I replace them with the new
> "cancel_dirty_page()". Most of the time, they should just be deleted
> entirely.
>
> NOTE NOTE NOTE! I _only_ did enough to make things compile for my
> particular configuration. That means that right now the following
> filesystems are broken with this patch (because they use the totally
> broken old crap):
>
> CIFS, FUSE, JFS, ReiserFS, XFS
XFS appears to call clear_page_dirty to get the mapping tree dirty
tag set correctly at the same time the page dirty flag is cleared. I
note that this can be done by set_page_writeback() if we clear the
dirty flag on the page first when we are writing back the entire page.
Hence it seems to me that the XFS call to clear_page_dirty() could
easily be substituted by clear_page_dirty_for_io() followed by a
call to set_page_writeback() to get the mapping tree tags set
correctly after the page has been marked clean.
Does this make sense (even without the posted patch)?
---
fs/xfs/linux-2.6/xfs_aops.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
Index: 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c 2006-12-19 12:22:47.000000000 +1100
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c 2006-12-21 10:15:04.545375877 +1100
@@ -340,9 +340,9 @@ xfs_start_page_writeback(
{
ASSERT(PageLocked(page));
ASSERT(!PageWriteback(page));
- set_page_writeback(page);
if (clear_dirty)
- clear_page_dirty(page);
+ clear_page_dirty_for_io(page);
+ set_page_writeback(page);
unlock_page(page);
if (!buffers) {
end_page_writeback(page);
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, 20 Dec 2006 11:50:50 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> Ok, I'll just put my money where my mouth is, and suggest a patch like
> THIS instead.
>
> ...
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index d1f1b54..263f88e 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
> int ret = 0;
>
> BUG_ON(!PageLocked(page));
> - if (PageWriteback(page))
> + if (PageDirty(page) || PageWriteback(page))
> return 0;
>
> if (mapping == NULL) { /* can this still happen? */
> @@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *page)
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
> - if (ret) {
> - /*
> - * If the filesystem writes its buffers by hand (eg ext3)
> - * then we can have clean buffers against a dirty page. We
> - * clean the page here; otherwise later reattachment of buffers
> - * could encounter a non-uptodate page, which is unresolvable.
> - * This only applies in the rare case where try_to_free_buffers
> - * succeeds but the page is not freed.
> - *
> - * Also, during truncate, discard_buffer will have marked all
> - * the page's buffers clean. We discover that here and clean
> - * the page also.
> - */
> - if (test_clear_page_dirty(page))
> - task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> - }
I think this will be OK, because vmscan has just run ->writepage anyway.
But we will need to make changes to truncate_complete_page() - make it run
cancel_dirty_page() before it runs do_invalidatepage().
I don't think there's anything preventing zap_pte_range() or perhaps a
pagefault from coming in and dirtying this page after we've tested
PageDirty().
That could leave us with a dirty, non-uptodate page with no buffers, which
is very bad. But this situation is hopefully impossible, because if the
page is not uptodate then the first thing a pagefault will do is bring it
uptodate, which involves locking it. And if zap_pte_range() is looking at
this page, it is uptodate.
If the page _was_ uptodate and the zap_pte_range() race happens, we'll end
up with with either a dirty page with dirty buffers or a dirty uptodate
page with no buffers, both of which are OK.
> +void cancel_dirty_page(struct page *page, unsigned int account_size)
> +{
> + /* If we're cancelling the page, it had better not be mapped any more */
> + if (page_mapped(page)) {
> + static unsigned int warncount;
> +
> + WARN_ON(++warncount < 5);
> + }
> +
> + if (TestClearPageDirty(page) && account_size)
> + task_io_account_cancelled_write(account_size);
> +}
This doesn't clear the radix-tree dirty tags. I'm not sure what effect
that would have on a truncated mapping. Perhaps just a bit of extra work
in radix-tree lookup during writeback.
If we _know_ that this page is about to be removed from pagecache then
radix_tree_delete() will delete the tags for us anyway, but
invalidate_inode_pages2() can decide to back out.
> @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
> PAGE_CACHE_SIZE, 0);
> }
> }
> - was_dirty = test_clear_page_dirty(page);
> - if (!invalidate_complete_page2(mapping, page)) {
> - if (was_dirty)
> - set_page_dirty(page);
> + if (!invalidate_complete_page2(mapping, page))
> ret = -EIO;
> - }
> unlock_page(page);
Well, it used to.
invalidate_complete_page2() is pretty gruesome. We're handling the case
where someone went and redirtied the page (and hence its buffers) after the
invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to
disk.
I'd prefer to just fail the direct-io if someone did that, but then
people's tests fail and they whine.
It's tempting to just truncate the damn page and discard the user's data -
the app is being silly. But that would permit access to uninitialised disk
blocks.
With your change I think what'll happen is that we'll correctly handle the
case where the page and its buffers are dirty (it gets left in place), but
we'll needlessy fail in the case where the page is dirty but the buffers
are clean. How important that will be in practice I do not know. People
will get -EIOs where they used not to.
A suitable fix for that might to be to simply not return -EIO here. So
some thread went and dirtied a pagecache page after
generic_file_direct_IO() synced the data. Big deal, that's your own fault.
Usually the disk will end up getting a copy of the dirtied pagecache page
and rarely it'll get a copy of the direct-io-written page.
On Wed, 20 Dec 2006, Andrew Morton wrote:
>
> > +void cancel_dirty_page(struct page *page, unsigned int account_size)
> > +{
> > + /* If we're cancelling the page, it had better not be mapped any more */
> > + if (page_mapped(page)) {
> > + static unsigned int warncount;
> > +
> > + WARN_ON(++warncount < 5);
> > + }
> > +
> > + if (TestClearPageDirty(page) && account_size)
> > + task_io_account_cancelled_write(account_size);
> > +}
>
> This doesn't clear the radix-tree dirty tags. I'm not sure what effect
> that would have on a truncated mapping. Perhaps just a bit of extra work
> in radix-tree lookup during writeback.
This should _only_ be a valid thing to do when we're removing the page
from a mapping anyway, so I'd most definitely hope that the code
immediately after (or before) will have done a "remove_from_page_cache()"
In which case the tags should not matter.
There is _no_ excuse for cancelling a page and leaving it in the page
cache that I can see. Because your page contents will be _indeterminate_.
> > @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
>
> invalidate_complete_page2() is pretty gruesome. We're handling the case
> where someone went and redirtied the page (and hence its buffers) after the
> invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to
> disk.
>
> I'd prefer to just fail the direct-io if someone did that, but then
> people's tests fail and they whine.
So with my change, afaik, we will just return EIO to the invalidate, and
do the write. Which should be ok. In fact, it appears to be the only
possibly valid thing to do.
It really boils down to that same thing: if you remove the dirty bit,
there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR:
- do the damn IO already ("clear_page_dirty_for_io()")
- truncate the page (unmap and destroy it both from page cache AND from
any user-visible filesystem cases)
Anything else is simpyl a bug. Always has been. My patch just makes that
very clear.
> With your change I think what'll happen is that we'll correctly handle the
> case where the page and its buffers are dirty (it gets left in place), but
> we'll needlessy fail in the case where the page is dirty but the buffers
> are clean. How important that will be in practice I do not know. People
> will get -EIOs where they used not to.
People will now get -EIO where they used to get an inconsistent system
image. I really think it sounds like an improvement.
Linus
On Thu, 21 Dec 2006, David Chinner wrote:
>
> XFS appears to call clear_page_dirty to get the mapping tree dirty
> tag set correctly at the same time the page dirty flag is cleared. I
> note that this can be done by set_page_writeback() if we clear the
> dirty flag on the page first when we are writing back the entire page.
Yes. I think the XFS routine should just use "clear_page_dirty_fir_io()",
since that matches what it actually wants to do (surprise surprise, it's
going to write it out).
HOWEVER. Why is it conditional? Can somebody who understands XFS tell me
why "clear_dirty" would ever be 0? I can grep the sources, and I see that
it's an unconditional 1 in one call-site, but then in the other one it
does
xfs_start_page_writeback(page, wbc, !page_dirty, count);
and that part just blows my mind. Why would you do a
xfs_start_page_writeback() and _not_ write the page out? Is this for a
partial-page-only case?
Anyway, your patch looks fine. It seems to be the right thing to do. I'm
just wondering why we're not always cleaning the whole page, and why we'd
not set it unconditionally dirty?
Linus
On Wed, 20 Dec 2006 15:55:14 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> > > @@ -386,12 +399,8 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
> >
> > invalidate_complete_page2() is pretty gruesome. We're handling the case
> > where someone went and redirtied the page (and hence its buffers) after the
> > invalidate_inode_pages2() caller (generic_file_direct_IO) synced it to
> > disk.
> >
> > I'd prefer to just fail the direct-io if someone did that, but then
> > people's tests fail and they whine.
>
> So with my change, afaik, we will just return EIO to the invalidate, and
> do the write.
The write's already been done by this stage.
> Which should be ok. In fact, it appears to be the only
> possibly valid thing to do.
>
> It really boils down to that same thing: if you remove the dirty bit,
> there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR:
> - do the damn IO already ("clear_page_dirty_for_io()")
> - truncate the page (unmap and destroy it both from page cache AND from
> any user-visible filesystem cases)
There's also redirty_page_for_writepage().
On Wed, 20 Dec 2006, Andrew Morton wrote:
> >
> > So with my change, afaik, we will just return EIO to the invalidate, and
> > do the write.
>
> The write's already been done by this stage.
Ok, but the end result is the same: you MUST NOT just "cancel" a write. It
needs to be done, or the backing store must be actually de-allocated. You
can't just say "get rid of it" and think that it can work. Exactly because
of security issues, and just the simple fact that reading it back gets
random contents.
So I repeat: clearing a dirty bit really only has two valid cases. Not
three, like we used to have. And the "cancel" case cannot be conditional:
either you can cancel it or you cannot. There's no
if (cancel_dirty_page()) {
..
sequence that makes sense that I can think of.
> > It really boils down to that same thing: if you remove the dirty bit,
> > there is NO CONCEIVABLE GOOD THING YOU CAN DO EXCEPT FOR:
> > - do the damn IO already ("clear_page_dirty_for_io()")
> > - truncate the page (unmap and destroy it both from page cache AND from
> > any user-visible filesystem cases)
>
> There's also redirty_page_for_writepage().
_dirtying_ a page makes sense in any situation. You can always dirty them.
I'm just saying that you can't just mark them *clean*.
If your point was that the filesystem had better be able to take care of
"redirty_page_for_writepage()", then yes, of course. But since it's the
filesystem itself that does it, it had _better_ be able to take care of
the situation it puts itself into.
Linus
Btw, I'd really love to hear whether the patch I sent out actually _helps_
at all, or whether we're just discussing something that in the end is just
a cleanup..
Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be
talking about different bugs, so _both_ of your experiences definitely
matter here).
Linus
On Wed, 20 Dec 2006, Linus Torvalds wrote:
> >
> > There's also redirty_page_for_writepage().
>
> _dirtying_ a page makes sense in any situation. You can always dirty them.
> I'm just saying that you can't just mark them *clean*.
>
> If your point was that the filesystem had better be able to take care of
> "redirty_page_for_writepage()", then yes, of course. But since it's the
> filesystem itself that does it, it had _better_ be able to take care of
> the situation it puts itself into.
Btw, as an example of something where this may NOT be ok, look at
migrate_page_copy().
I'm not at all convinced that "migrate_page_copy()" can work at all. It
does:
...
if (PageDirty(page)) {
clear_page_dirty_for_io(page);
set_page_dirty(newpage);
}
...
which is an example of what NOT to do, because it claims to clear the page
for IO, but doesn't actually _do_ any IO.
And this is wrong, for many reasons.
For example, it's very possible that the old page is not actually
up-to-date, and is only partially dirty using some FS-specific dirty data
queues (like NFS does with its dirty data, or buffer-heads can do for
local filesystems). When you do
if (clear_dirty(page))
set_page_dirty(page);
in generic VM code, that is a BUG. It's an insane operation. It cannot
work. It's exactly what I'm trying to avoid.
So page migration is probably broken, but it's no less broken than it
always has been. And I don't think many people use it anyway. It might
work "by accident" in a lot of situations, but to actually be solid, it
really would need to do something fundamentally different, like:
- have a per-mapping "migrate()" function that actually knows HOW to
migrate the dirty state from one page to another.
- or, preferably, by just not migrating dirty pages, and just actually
doing the writeback on them first.
Again, this is an example of just _incorrect_ code, that thinks that it
can "just clear the dirty bit". You can't do that. It's wrong. And it is
not wrong just because I say so, but because the operations itself simply
is FUNDAMENTALLY not a sensible one.
This is why I keep harping on this issue: there are two cases, and two
cases only, when you can clear a page. And no, "migrating the data to
another page" was not one of those two cases. The cases are, and will
_always_ be: (a) full writeback IO of _all_ the dirty data on the page
(and that can only be done by the low-level filesystem, since it's the
only one that knows what rules it has followed for marking things dirty)
and (b) cancelling dirty data that got truncated and literally removed
from the filesystem.
So I don't claim that I fixed all the cases. mm/migrate.c is still broken.
Maybe somebody else also uses "clear_page_dirty_for_io()" even though the
name very clearly says FOR IO. I didn't check, but I think they're mostly
right now.
Linus
On Wed, 20 Dec 2006 16:43:31 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Wed, 20 Dec 2006, Linus Torvalds wrote:
> > >
> > > There's also redirty_page_for_writepage().
> >
> > _dirtying_ a page makes sense in any situation. You can always dirty them.
> > I'm just saying that you can't just mark them *clean*.
> >
> > If your point was that the filesystem had better be able to take care of
> > "redirty_page_for_writepage()", then yes, of course. But since it's the
> > filesystem itself that does it, it had _better_ be able to take care of
> > the situation it puts itself into.
>
> Btw, as an example of something where this may NOT be ok, look at
> migrate_page_copy().
>
> I'm not at all convinced that "migrate_page_copy()" can work at all. It
> does:
>
> ...
> if (PageDirty(page)) {
> clear_page_dirty_for_io(page);
> set_page_dirty(newpage);
Note that this is referring to different pages.
> }
> ...
>
> which is an example of what NOT to do, because it claims to clear the page
> for IO, but doesn't actually _do_ any IO.
>
> And this is wrong, for many reasons.
>
> For example, it's very possible that the old page is not actually
> up-to-date, and is only partially dirty using some FS-specific dirty data
> queues (like NFS does with its dirty data, or buffer-heads can do for
> local filesystems).
afaict the code copes with those things.
> When you do
>
> if (clear_dirty(page))
> set_page_dirty(page);
>
> in generic VM code, that is a BUG. It's an insane operation. It cannot
> work. It's exactly what I'm trying to avoid.
These are different pages.
We could view the copy_highpage() in migrate_page_copy() as an "io"
operation, only the backing store is a new pagecache page.
It'd be more logical if that copy_highpage() was occurring after the
clear_page_dirty_for_io().
I'm not sure why migrate_page_copy() is playing with
PageWriteback(newpage). Surely newpage is locked, in which case nobody
will be starting writeback on it.
> So page migration is probably broken, but it's no less broken than it
> always has been. And I don't think many people use it anyway. It might
> work "by accident" in a lot of situations, but to actually be solid, it
> really would need to do something fundamentally different, like:
>
> - have a per-mapping "migrate()" function that actually knows HOW to
> migrate the dirty state from one page to another.
That is how it's presently implemented. You're looking at helper functions
which fileystems may point their address_space_operations.migratepage at.
> - or, preferably, by just not migrating dirty pages, and just actually
> doing the writeback on them first.
>
> Again, this is an example of just _incorrect_ code, that thinks that it
> can "just clear the dirty bit". You can't do that. It's wrong. And it is
> not wrong just because I say so, but because the operations itself simply
> is FUNDAMENTALLY not a sensible one.
The dirty state is being transferred to the new page. The tricky part is
handling the cases where these pages are mapped into pagetables. That's
what the special migration ptes are there for. I'll let Christoph explain
that lot ;)
On Wed, Dec 20, 2006 at 03:55:25PM -0800, Linus Torvalds wrote:
> On Thu, 21 Dec 2006, David Chinner wrote:
> >
> > XFS appears to call clear_page_dirty to get the mapping tree dirty
> > tag set correctly at the same time the page dirty flag is cleared. I
> > note that this can be done by set_page_writeback() if we clear the
> > dirty flag on the page first when we are writing back the entire page.
>
> Yes. I think the XFS routine should just use "clear_page_dirty_fir_io()",
> since that matches what it actually wants to do (surprise surprise, it's
> going to write it out).
Yup ;)
> HOWEVER. Why is it conditional? Can somebody who understands XFS tell me
> why "clear_dirty" would ever be 0? I can grep the sources, and I see that
> it's an unconditional 1 in one call-site, but then in the other one it
> does
>
> xfs_start_page_writeback(page, wbc, !page_dirty, count);
page dirty starts at the number of dirty buffers on the page, and as
we map each dirty buffer into the I/O we decrement the page dirty count.
Hence if we map all of the buffers into the I/O, we are cleaning
the entire page and hence we can clear the dirty state on the page.
> and that part just blows my mind. Why would you do a
> xfs_start_page_writeback() and _not_ write the page out? Is this for a
> partial-page-only case?
Yes, partial-page-only case when doing speculative write clustering. We'll hit
this when an extent boundary is not page aligned (fs block size < page size
case) and we need to issue at least two separate I/Os to clean the page.
Because we leave the page dirty and we are working ahead of the index in
generic_writepages() we'll get the rest of the page flushed when we return
back to generic_writepages() as the page is still dirty in the mapping
tree....
> Anyway, your patch looks fine. It seems to be the right thing to do.
Ok, thanks, Linus.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote:
> I think this is also needed:
NAK
invalidate_inode_pages2() should _not_ be pretending that dirty pages
are clean. This patch is incorrect both for the NFS usage and for the
directIO usage.
In the latter case, if someone has the page mmapped, resulting in the
page getting marked as dirty _after_ a directIO write, then it would be
wrong to discard that data. Only dirty data from _before_ the directIO
write should needs to be discarded (and that is achieved by unmapping,
then cleaning the page prior to the directIO call)...
For the NFS case, the race is a bit more tricky, since you have the
"unstable write" case which means that the page is neither marked as
dirty, nor is entirely clean ('cos we don't know that the server has
committed the data to permanent storage yet).
Cheers
Trond
> ---
> mm/truncate.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -320,19 +320,14 @@ invalidate_complete_page2(struct address
> if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
> return 0;
>
> + cancel_dirty_page(page, PAGE_CACHE_SIZE);
> lock_page_ref_irq(page);
> - if (PageDirty(page))
> - goto failed;
> -
> BUG_ON(PagePrivate(page));
> __remove_from_page_cache(page);
> unlock_page_ref_irq(page);
> ClearPageUptodate(page);
> page_cache_release(page); /* pagecache ref */
> return 1;
> -failed:
> - unlock_page_ref_irq(page);
> - return 0;
> }
>
> /**
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2006-12-20 at 15:55 -0800, Linus Torvalds wrote:
> > With your change I think what'll happen is that we'll correctly handle the
> > case where the page and its buffers are dirty (it gets left in place), but
> > we'll needlessy fail in the case where the page is dirty but the buffers
> > are clean. How important that will be in practice I do not know. People
> > will get -EIOs where they used not to.
>
> People will now get -EIO where they used to get an inconsistent system
> image. I really think it sounds like an improvement.
The hell it is. You end up with a corrupted page cache because
invalidate_inode_pages2_range() immediately exits without throwing out
the pages in the rest of the range.
I can't see that it is the business of invalidate_inode_pages2() to
resolve races between ->direct_IO() and pages that are redirtied by
mmap(). All it needs to ensure is that pages that clean are discarded,
since those are neither consistent with data that the ->directIO() call
wrote to the disk nor are they scheduled to be written to disk.
The only case that I can see that is still problematic is NFS because it
may have unstable writes (hence the ->launder_page() patch that I posted
yesterday).
Trond
On 12/20/06, Linus Torvalds <[email protected]> wrote:
> Ok, I'll just put my money where my mouth is, and suggest a patch like
> THIS instead.
> Martin, Andrei, does this make any difference for your corruption cases?
Unfortunately, I cannot get the latest git version of the kernel to
boot on the ARM machine on which Martin and I are experiencing the apt
segfault. After the kernel is finished uncompressing it prints "done,
booting the kernel." as expected, but nothing more happens. I have
tried both with and without the patch. Hopefully either Andrei or
Martin will have better luck at testing this patch than I have had.
Gordon
--
Gordon Farquharson
On Thu, 21 Dec 2006, Gordon Farquharson wrote:
>
> Unfortunately, I cannot get the latest git version of the kernel to
> boot on the ARM machine on which Martin and I are experiencing the apt
> segfault.
Ouch.
> After the kernel is finished uncompressing it prints "done,
> booting the kernel." as expected, but nothing more happens. I have
> tried both with and without the patch. Hopefully either Andrei or
> Martin will have better luck at testing this patch than I have had.
That's obviously a bug worth fixing on its own. Do you know when it
started?
That said, I think the patch I sent out should actually work on top of
plain 2.6.19 too. I don't think things have changed in this area that
much. IOW, you don't _need_ latest -git to test it, you just need a broken
kernel ;)
Linus
On Wed, 2006-12-20 at 21:36 -0500, Trond Myklebust wrote:
> On Wed, 2006-12-20 at 23:15 +0100, Peter Zijlstra wrote:
> > I think this is also needed:
>
> NAK
>
> invalidate_inode_pages2() should _not_ be pretending that dirty pages
> are clean. This patch is incorrect both for the NFS usage and for the
> directIO usage.
>
> In the latter case, if someone has the page mmapped, resulting in the
> page getting marked as dirty _after_ a directIO write, then it would be
> wrong to discard that data. Only dirty data from _before_ the directIO
> write should needs to be discarded (and that is achieved by unmapping,
> then cleaning the page prior to the directIO call)...
>
> For the NFS case, the race is a bit more tricky, since you have the
> "unstable write" case which means that the page is neither marked as
> dirty, nor is entirely clean ('cos we don't know that the server has
> committed the data to permanent storage yet).
Then this patch:
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc1/2.6.20-rc1-mm1/broken-out/nfs-fix-nr_file_dirty-underflow.patch
is equally wrong, right?
* Russell King <[email protected]> [2006-12-20 22:11]:
> > This patch doesn't fix my problem (apt segfaults on ARM because its
> > database is corrupted).
>
> Are you using IDE in PIO mode? If so, the bug probably lies there.
I'm using usb-storage. It's used to access an external IDE drive in
an USB enclosure but I don't think it matters that it's IDE since
we're using the SCSI layer to talk to it, right?
--
Martin Michlmayr
http://www.cyrius.com/
* Linus Torvalds <[email protected]> [2006-12-20 23:53]:
> > Unfortunately, I cannot get the latest git version of the kernel to
> > boot on the ARM machine on which Martin and I are experiencing the apt
> > segfault.
>
> Ouch.
>
> That's obviously a bug worth fixing on its own. Do you know when it
> started?
This is a known issue. The following patch has been proposed
http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1
although I just notice that it has been marked as "discarded".
Apparently Russell King commited a better patch so this should be
fixed in git when he sends his next pull request.
--
Martin Michlmayr
http://www.cyrius.com/
On Thu, 21 Dec 2006, Martin Michlmayr wrote:
>
> This is a known issue. The following patch has been proposed
> http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=4030/1
> although I just notice that it has been marked as "discarded".
> Apparently Russell King commited a better patch so this should be
> fixed in git when he sends his next pull request.
Ahh, ok. Then it might even be in the set of merges I did earlier today
(and which should mirror out soon enough, hopefully).
Linus
On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote:
> current version
Nitpicking ..
> @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
> if (!pte)
> goto out;
>
> - if (!pte_dirty(*pte) && !pte_write(*pte))
> - goto unlock;
> + while (pte_dirty(*pte) || pte_write(*pte)) {
> + pte_t entry;
>
> - entry = ptep_get_and_clear(mm, address, pte);
> - entry = pte_mkclean(entry);
> - entry = pte_wrprotect(entry);
> - ptep_establish(vma, address, pte, entry);
> - lazy_mmu_prot_update(entry);
> - ret = 1;
> + flush_cache_page(vma, address, pte_pfn(*pte));
> + entry = ptep_clear_flush(vma, address, pte);
> + entry = pte_wrprotect(entry);
> + entry = pte_mkclean(entry);
> + ptep_establish(vma, address, pte, entry);
Now you are flushing the tlb twice. ptep_clear_flush clears the pte and
flushes the tlb, ptep_establish sets the new pte and flushes the tlb.
Not good. Use set_pte_at instead of the ptep_establish.
> + lazy_mmu_prot_update(entry);
> + ret = 1;
> + }
>
> -unlock:
> pte_unmap_unlock(pte, ptl);
> out:
> return ret;
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On 12/21/06, Linus Torvalds <[email protected]> wrote:
> That said, I think the patch I sent out should actually work on top of
> plain 2.6.19 too. I don't think things have changed in this area that
> much. IOW, you don't _need_ latest -git to test it, you just need a broken
> kernel ;)
I created a version of your patch that applied to 2.6.19, but it
doesn't compile:
mm/built-in.o: In function `cancel_dirty_page':
slab.c:(.text+0x8964): undefined reference to `task_io_account_cancelled_write'
make[3]: *** [.tmp_vmlinux1] Error 1
It looks like task_io_account_cancelled_write() was added in
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7c3ab7381e79dfc7db14a67c6f4f3285664e1ec2
Can the call to task_io_account_cancelled_write() simply be removed
from cancel_dirty_page() for testing the patch with 2.6.19 (since
2.6.19 doesn't seem to have the task I/O accounting) ?
Gordon
--
Gordon Farquharson
On Thu, 2006-12-21 at 10:16 +0100, Martin Schwidefsky wrote:
> On Thu, 2006-12-21 at 00:03 +0100, Peter Zijlstra wrote:
> > current version
>
> Nitpicking ..
>
> > @@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
> > if (!pte)
> > goto out;
> >
> > - if (!pte_dirty(*pte) && !pte_write(*pte))
> > - goto unlock;
> > + while (pte_dirty(*pte) || pte_write(*pte)) {
> > + pte_t entry;
> >
> > - entry = ptep_get_and_clear(mm, address, pte);
> > - entry = pte_mkclean(entry);
> > - entry = pte_wrprotect(entry);
> > - ptep_establish(vma, address, pte, entry);
> > - lazy_mmu_prot_update(entry);
> > - ret = 1;
> > + flush_cache_page(vma, address, pte_pfn(*pte));
> > + entry = ptep_clear_flush(vma, address, pte);
> > + entry = pte_wrprotect(entry);
> > + entry = pte_mkclean(entry);
> > + ptep_establish(vma, address, pte, entry);
>
> Now you are flushing the tlb twice. ptep_clear_flush clears the pte and
> flushes the tlb, ptep_establish sets the new pte and flushes the tlb.
> Not good. Use set_pte_at instead of the ptep_establish.
Yeah, sorry, I already noticed and corrected that :-|
Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing
at the beginning of the loop. flush_tlb_page() does IPI the other cpus
to flush their tlb too, so there should not be a SMP race, Arjan?
> > + lazy_mmu_prot_update(entry);
> > + ret = 1;
> > + }
> >
> > -unlock:
> > pte_unmap_unlock(pte, ptl);
> > out:
> > return ret;
>
On Thu, 2006-12-21 at 10:20 +0100, Peter Zijlstra wrote:
> > Now you are flushing the tlb twice. ptep_clear_flush clears the pte and
> > flushes the tlb, ptep_establish sets the new pte and flushes the tlb.
> > Not good. Use set_pte_at instead of the ptep_establish.
>
> Yeah, sorry, I already noticed and corrected that :-|
>
> Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing
> at the beginning of the loop. flush_tlb_page() does IPI the other cpus
> to flush their tlb too, so there should not be a SMP race, Arjan?
The while loop is protected by the pte lock and flush_tlb_page has to
remove the tlbs on all cpus. So yes, I think the while loop is not
necessary.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Thu, 21 Dec 2006 02:17:05 -0700
"Gordon Farquharson" <[email protected]> wrote:
> Can the call to task_io_account_cancelled_write() simply be removed
> from cancel_dirty_page() for testing the patch with 2.6.19 (since
> 2.6.19 doesn't seem to have the task I/O accounting) ?
Yes.
On Thu, Dec 21, 2006 at 09:18:45AM +0100, Martin Michlmayr wrote:
> * Russell King <[email protected]> [2006-12-20 22:11]:
> > > This patch doesn't fix my problem (apt segfaults on ARM because its
> > > database is corrupted).
> >
> > Are you using IDE in PIO mode? If so, the bug probably lies there.
>
> I'm using usb-storage. It's used to access an external IDE drive in
> an USB enclosure but I don't think it matters that it's IDE since
> we're using the SCSI layer to talk to it, right?
USB generally uses DMA so you're probably safe.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
* Linus Torvalds <[email protected]> [2006-12-20 11:50]:
> Martin, Andrei, does this make any difference for your corruption
> cases?
Works for me.
--
Martin Michlmayr
http://www.cyrius.com/
On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote:
> That's obviously a bug worth fixing on its own. Do you know when it
> started?
My last merge, just before 2.6.19-rc1.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Thu, Dec 21, 2006 at 12:30:22PM +0000, Russell King wrote:
> On Wed, Dec 20, 2006 at 11:53:25PM -0800, Linus Torvalds wrote:
> > That's obviously a bug worth fixing on its own. Do you know when it
> > started?
>
> My last merge, just before 2.6.19-rc1.
Obviously 2.6.20-rc1.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
>
> Btw,
> here's a totally new tangent on this: it's possible that user code is
> simply BUGGY.
depmod: BADNESS: written outside isize 22183
---
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..5db9fd9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2393,6 +2393,17 @@ int nobh_commit_write(struct file *file, struct page *page,
}
EXPORT_SYMBOL(nobh_commit_write);
+static void __check_tail_zero(char *kaddr, unsigned int offset)
+{
+ unsigned int check = 0;
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_CACHE_SIZE);
+ if (check)
+ printk(KERN_ERR "%s: BADNESS: written outside isize %u\n",
+ current->comm, check);
+}
+
/*
* nobh_writepage() - based on block_full_write_page() except
* that it tries to operate without attaching bufferheads to
@@ -2437,6 +2448,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block,
* writes to that region are not written out to the file."
*/
kaddr = kmap_atomic(page, KM_USER0);
+ __check_tail_zero(kaddr, offset);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
@@ -2604,6 +2616,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block,
* writes to that region are not written out to the file."
*/
kaddr = kmap_atomic(page, KM_USER0);
+ __check_tail_zero(kaddr, offset);
memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
flush_dcache_page(page);
kunmap_atomic(kaddr, KM_USER0);
On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote:
>
> Btw, I'd really love to hear whether the patch I sent out actually _helps_
> at all, or whether we're just discussing something that in the end is just
> a cleanup..
>
> Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be
> talking about different bugs, so _both_ of your experiences definitely
> matter here).
with http://lkml.org/lkml/diff/2006/12/20/204/1
I have corruption: Hash check on download completion found bad chunks,
consider using "safe_sync".
>
> Linus
On Thu, 21 Dec 2006, Andrei Popa wrote:
> On Wed, 2006-12-20 at 16:24 -0800, Linus Torvalds wrote:
> >
> > Martin, Peter, Andrei, pls give it a try. (Martin and Andrei may be
> > talking about different bugs, so _both_ of your experiences definitely
> > matter here).
>
> with http://lkml.org/lkml/diff/2006/12/20/204/1
> I have corruption: Hash check on download completion found bad chunks,
> consider using "safe_sync".
Gaah. Martin Michlmayr reported that it apparently fixes his ARM
corruption.
Now, admittedly I already suspected the issues might be different (if only
because of the UP vs SMP/PREEMPT case), but I really had my hopes up after
Martin's report, because if anything, _his_ issue might have been a
superset of your problem (while obviously any subtle SMP races you might
be seeing are definitely not an issue in his case).
Oh well. I think the ARM case is enough of a reason to apply those patches
(if it hadn't made any difference at all, I'd have waited until after
2.6.20), and we'll just have to continue on the SMP PREEMPT angle.
Linus
On Wed, 20 Dec 2006, Trond Myklebust wrote:
>
> I can't see that it is the business of invalidate_inode_pages2() to
> resolve races between ->direct_IO() and pages that are redirtied by
> mmap(). All it needs to ensure is that pages that clean are discarded,
> since those are neither consistent with data that the ->directIO() call
> wrote to the disk nor are they scheduled to be written to disk.
Sure, we could happily just remove the -EIO. Alternatively, we could still
do all the invalidates over the whole range, and return -EIO at the end of
any of the pages weren't invalidated because they had to be written back.
I don't personally care whether we should just return success or something
to indicate that there were busy pages, but somebody who _uses_ direct-IO
might want to know that the thing didn't throw away everything. If you
know such users, can you ask them?
(Maybe "-EAGAIN" is better than "-EIO", since it's not really even a fatal
error).
Linus
On Thu, 21 Dec 2006, Peter Zijlstra wrote:
>
> Also, I'm dubious about the while thing and stuck a WARN_ON(ret) thing
> at the beginning of the loop. flush_tlb_page() does IPI the other cpus
> to flush their tlb too, so there should not be a SMP race, Arjan?
Now, the reason I think the loop may be needed is:
CPU#0 CPU#1
----- -----
load old PTE entry
clear dirty and WP bits
write to page using old PTE
NOT CHECKING that the new one
is write-protected, and just
setting the dirty bit blindly
(but atomically)
flush_tlb_page()
TLB flushed, but we now have a
page that is marked dirty and
unwritable in the page tables,
and we will mark it clean in
"struct page *"
Now, the scary thing is, IF a CPU does this, then the way we do all this,
we may actually have the following sequence:
CPU#0 CPU#1
----- -----
load old PTE entry
ptep_clear_flush():
atomic "set dirty bit" sequence
PTEP now contains 0000040 !!!
flush_tlb_page();
TLB flushed, but PTEP is still
"dirty zero"
write the clear/readonly PTE
THE DIRTY BIT WAS LOST!
which might actually explain this bug.
I personally _thought_ that Intel CPU's don't actually do an "set dirty
bit atomically" sequence, but more of a "set dirty bit but trap if the TLB
is nonpresent" thing, but I have absolutely no proof for that.
Anyway, IF this is the case, then the following patch may or may not fix
things. It avoids things by never overwriting a PTE entry, not even the
"cleared" one. It always does an atomic "xchg()" with a valid new entry,
and looks at the old bits.
What do you guys think? Does something like this work out for S/390 too? I
tried to make that "ptep_flush_dirty()" concept work for architectures
that hide the dirty bit somewhere else too, but..
It actually simplifies the architecture-specific code (you just need to
implement a trivial "ptep_exchange()" and "ptep_flush_dirty()" macro), but
I only did x86-64 and i386, and while I've booted with this, I haven't
really given the thing a lot of really _deep_ thought.
But I think this might be safer, as per above.. And it _might_ actually
explain the problem. Exactly because the "ptep_clear() + blindly assign to
ptep" might lose a dirty bit that was written by another CPU.
But this really does depend on what a CPU does when it marks a page dirty.
Does it just blindly write the dirty bit? Or does it actually _validate_
that the old page table entry was still present and writable?
This patch makes no assumptions. It should work even if a CPU just writes
the dirty bit blindly, and the only expectation is that the page tables
can be accessed atomically (which had _better_ be true on any SMP
architecture)
Arjan, can you please check within Intel, and ask what the "proper"
sequence for doing something like this is?
Linus
----
commit 301d2d53ca0e5d2f61b1c1c259da410c7ee6d6a7
Author: Linus Torvalds <[email protected]>
Date: Thu Dec 21 11:11:05 2006 -0800
Rewrite the page table "clear dirty and writable" accesses
This is much simpler for most architectures, and allows us to do the
dirty and writable clear in a single operation without any races or any
double flushes.
It's also much more careful: we never overwrite the old dirty bits at
any time, and always make sure to do atomic memory ops to exchange and
see the old value.
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 9d774d0..8879f1d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -61,31 +61,6 @@ do { \
})
#endif
-#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(__vma, __address, __ptep) \
-({ \
- pte_t __pte = *__ptep; \
- int r = 1; \
- if (!pte_dirty(__pte)) \
- r = 0; \
- else \
- set_pte_at((__vma)->vm_mm, (__address), (__ptep), \
- pte_mkclean(__pte)); \
- r; \
-})
-#endif
-
-#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(__vma, __address, __ptep) \
-({ \
- int __dirty; \
- __dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep); \
- if (__dirty) \
- flush_tlb_page(__vma, __address); \
- __dirty; \
-})
-#endif
-
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define ptep_get_and_clear(__mm, __address, __ptep) \
({ \
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..b61d6f9 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -300,18 +300,20 @@ do { \
flush_tlb_page(vma, address); \
} while (0)
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(vma, address, ptep) \
-({ \
- int __dirty; \
- __dirty = pte_dirty(*(ptep)); \
- if (__dirty) { \
- clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low); \
- pte_update_defer((vma)->vm_mm, (address), (ptep)); \
- flush_tlb_page(vma, address); \
- } \
- __dirty; \
-})
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+ (old).pte_low = xchg(&(ptep)->pte_low, (new).pte_low);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+ pte_dirty(old)
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
#define ptep_clear_flush_young(vma, address, ptep) \
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index 59901c6..07754b5 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -283,12 +283,20 @@ static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) &
struct vm_area_struct;
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
-{
- if (!pte_dirty(*ptep))
- return 0;
- return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte);
-}
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+ (old).pte = xchg(&(ptep)->pte, (new).pte);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+ pte_dirty(old)
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
{
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..a028803 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *ptep;
spinlock_t *ptl;
int ret = 0;
@@ -440,22 +440,24 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
if (address == -EFAULT)
goto out;
- pte = page_check_address(page, mm, address, &ptl);
- if (!pte)
- goto out;
-
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
-
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
-
-unlock:
- pte_unmap_unlock(pte, ptl);
+ ptep = page_check_address(page, mm, address, &ptl);
+ if (ptep) {
+ pte_t old, new;
+
+ old = *ptep;
+ new = pte_wrprotect(pte_mkclean(old));
+ if (!pte_same(old, new)) {
+ for (;;) {
+ flush_cache_page(vma, address, page_to_pfn(page));
+ ptep_exchange(vma, address, ptep, old, new);
+ if (pte_same(old, new))
+ break;
+ ret |= ptep_flush_dirty(vma, address, ptep, old);
+ flush_tlb_page(vma, address);
+ }
+ }
+ pte_unmap_unlock(pte, ptl);
+ }
out:
return ret;
}
On Thu, 21 Dec 2006 14:03:20 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, 2006-12-19 at 09:43 -0800, Linus Torvalds wrote:
> >
> > Btw,
> > here's a totally new tangent on this: it's possible that user code is
> > simply BUGGY.
>
> depmod: BADNESS: written outside isize 22183
akpm:/usr/src/module-init-tools-3.3-pre1> grep -r mmap .
./zlibsupport.c: map = mmap(0, *size, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
So presumably it's in a library.
akpm:/usr/src/25> ldd /sbin/depmod
linux-gate.so.1 => (0xffffe000)
libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0x46afa000)
/lib/ld-linux.so.2 (0x4631d000)
worrisome.
On 12/21/06, Andrew Morton <[email protected]> wrote:
> > Can the call to task_io_account_cancelled_write() simply be removed
> > from cancel_dirty_page() for testing the patch with 2.6.19 (since
> > 2.6.19 doesn't seem to have the task I/O accounting) ?
>
> Yes.
I tested 2.6.19 with a version of Linus's patch that applies cleanly
to 2.6.19 (patch appended to the end of this email) on ARM and apt-get
failed. It did not segfault this time, but instead got stuck for about
20 to 30 minutes and was accessing the hard drive frequently.
Here is some background about the problem we see with apt which may
help somebody with knowledge of the apt source code analyse the
problem in the context of the patch. When apt-get is first run, it
generates pkgcache.bin and srcpkgcache.bin in /var/cache/apt. We have
found that these are the files that get corrupted when we apply the
patch "mm: tracking shared dirty pages" [1] to 2.6.18. The corruption
of these files is what causes apt-get to segfault. I have observed
that the normal operation of apt-get is that while apt-get is
generating these files, pkgcache.bin grows to 12582912 bytes, and when
apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is
64254483 bytes. This time, when apt-get exited, it had only created
pkgcache.bin which was still 12582912 bytes. Also, the patch caused
apt to slow down a lot. I ran apt-get -f install after apt had exited,
and it took so long that I killed it before it had finished.
I did not try 2.6.20-git, but I presume that this version is what
Martin tried earlier. Maybe Linus's patch doesn't work with 2.6.19,
because 2.6.19 is missing some other patch.
Gordon
[1] http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/fs/buffer.c 2006-12-21 01:16:31.000000000 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- */
- clear_page_dirty(page);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.000000000 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/include/linux/page-flags.h 2006-12-21
01:15:21.000000000 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
-{
- test_clear_page_dirty(page);
-}
-
static inline void set_page_writeback(struct page *page)
{
test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/memory.c 2006-12-21 01:15:21.000000000 -0700
@@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_
}
EXPORT_SYMBOL(unmap_mapping_range);
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr,KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1865,6 +1892,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
out_truncate:
diff -Naupr linux-2.6.19.orig/mm/page-writeback.c
linux-2.6.19/mm/page-writeback.c
--- linux-2.6.19.orig/mm/page-writeback.c 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/mm/page-writeback.c 2006-12-21 01:26:53.000000000 -0700
@@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag
EXPORT_SYMBOL(set_page_dirty_lock);
/*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
- struct address_space *mapping = page_mapping(page);
- unsigned long flags;
-
- if (mapping) {
- write_lock_irqsave(&mapping->tree_lock, flags);
- if (TestClearPageDirty(page)) {
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page),
- PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- /*
- * We can continue to use `mapping' here because the
- * page is locked, which pins the address_space
- */
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
- dec_zone_page_state(page, NR_FILE_DIRTY);
- }
- return 1;
- }
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- return 0;
- }
- return TestClearPageDirty(page);
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*
diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c
--- linux-2.6.19.orig/mm/truncate.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/truncate.c 2006-12-21 15:58:18.000000000 -0700
@@ -50,6 +50,17 @@ static inline void truncate_partial_page
do_invalidatepage(page, partial);
}
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+ /* If we're cancelling the page, it had better not be mapped
any more */+ if (page_mapped(page)) {
+ static unsigned int warncount;
+
+ WARN_ON(++warncount < 5);
+ }
+}
+
+
/*
* If truncate cannot remove the fs-private metadata from the page, the page
* becomes anonymous. It will be left on the LRU and may even be mapped into
@@ -69,7 +80,8 @@ truncate_complete_page(struct address_sp
if (PagePrivate(page))
do_invalidatepage(page, 0);
- clear_page_dirty(page);
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
remove_from_page_cache(page);
@@ -348,7 +360,6 @@ int invalidate_inode_pages2_range(struct
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
- int was_dirty;
lock_page(page);
if (page->mapping != mapping) {
@@ -384,12 +395,8 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page2(mapping, page)) {
- if (was_dirty)
- set_page_dirty(page);
+ if (!invalidate_complete_page2(mapping, page))
ret = -EIO;
- }
unlock_page(page);
}
pagevec_release(&pvec);
--
Gordon Farquharson
On Thu, 21 Dec 2006, Gordon Farquharson wrote:
>
> I tested 2.6.19 with a version of Linus's patch that applies cleanly
> to 2.6.19 (patch appended to the end of this email) on ARM and apt-get
> failed. It did not segfault this time, but instead got stuck for about
> 20 to 30 minutes and was accessing the hard drive frequently.
Ok, there's definitely something screwy going on.
Andrew located at least one bug: we run cancel_dirty_page() too late in
"truncate_complete_page()", which means that do_invalidatepage() ends up
not clearing the page cache.
His patch is appended.
But it sounds like I probably misunderstood something, because I thought
that Martin had acknowledged that this patch actually worked for him.
Which sounded very similar to your setup (he has a 32M ARM box too, no?)
And your failure sounds a lot like one that David Miller is reporting. At
the same time, my own shared file mmap tests on my own machines obviously
work fine (I lower the dirty-writeback tresholds to force writeback more
easily, and then mmap a file and write and rewrite to it in memory, and
truncate it).
Maybe it's mount option issue? I've got data=ordered on my machine, are
you perhaps runnign with something else?
Linus
---
commit 3e67c0987d7567ad666641164a153dca9a43b11d
Author: Andrew Morton <[email protected]>
Date: Thu Dec 21 11:00:33 2006 -0800
[PATCH] truncate: clear page dirtiness before running try_to_free_buffers()
truncate presently invalidates the dirty page's buffer_heads then shoots down
the page. But try_to_free_buffers() will now bale out because the page is
dirty.
Net effect: the LRU gets filled with dirty pages which have invalidated
buffer_heads attached. They have no ->mapping and hence cannot be cleaned.
The machine leaks memory at an enormous rate.
Fix this by cleaning the page before running try_to_free_buffers(), so
try_to_free_buffers() can do its work.
Also, remember to do dirty-page-acoounting in cancel_dirty_page() so the
machine won't wedge up trying to write non-existent dirty pages.
Probably still wrong, but now less so.
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/mm/truncate.c b/mm/truncate.c
index bf9e296..89a5c35 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -60,11 +60,12 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
WARN_ON(++warncount < 5);
}
- if (TestClearPageDirty(page) && account_size)
+ if (TestClearPageDirty(page) && account_size) {
+ dec_zone_page_state(page, NR_FILE_DIRTY);
task_io_account_cancelled_write(account_size);
+ }
}
-
/*
* If truncate cannot remove the fs-private metadata from the page, the page
* becomes anonymous. It will be left on the LRU and may even be mapped into
@@ -81,11 +82,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return;
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
if (PagePrivate(page))
do_invalidatepage(page, 0);
- cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
remove_from_page_cache(page);
* Linus Torvalds <[email protected]> [2006-12-21 20:54]:
> But it sounds like I probably misunderstood something, because I thought
> that Martin had acknowledged that this patch actually worked for him.
That's what I thought too but now I can confirm what Gordon sees. But
it's pretty weird. Our testcase is to run Debian installer on the
NSLU2 arm device and apt-get would either segfault or hang at this
particular spot in the installation (when apt is first run). With
your patch, apt works correctly where it normally fails (at least for
me). I stopped the installation at this point and repeated it several
more times to make sure it's really working. And, yes, I can repeat
this result.
This time, however, I let the installer continue and it seems that
with your patch apt now works where it failed in the past, but it
hangs later on. It's pretty weird because I cannot even kill the
process:
sh-3.1# ps aux | grep 31126
root 31126 5.7 20.6 16240 6076 ? R+ 04:45 0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest
root 31157 0.0 1.6 1516 492 ttyS0 S+ 04:51 0:00 grep 31126
sh-3.1# kill -9 31126
sh-3.1# kill -9 31126
sh-3.1# ps aux | grep 31126
root 31126 5.6 20.6 16240 6076 ? R+ 04:45 0:21 apt-get -o APT::Status-Fd=4 -o APT::Keep-Fds::=5 -o APT::Keep-Fds::=6 -q -y -f install popularity-contest
root 31159 0.0 1.6 1516 492 ttyS0 S+ 04:51 0:00 grep 31126
sh-3.1#
> Which sounded very similar to your setup (he has a 32M ARM box too, no?)
It's the same device, a Linksys NSLU2.
> Author: Andrew Morton <[email protected]>
This patch makes it even worse for me.
> - if (TestClearPageDirty(page) && account_size)
> + if (TestClearPageDirty(page) && account_size) {
> + dec_zone_page_state(page, NR_FILE_DIRTY);
> task_io_account_cancelled_write(account_size);
> + }
This hunk (on top of git from about 2 days ago and your latest patch)
results in the installer hanging right at the start. The Linux kernel
boots fine, the debian-installer is loaded into a ramdisk but when
ncurses is being started it just hangs. Reverting this hunk makes it
start again.
Does that help or confuse you even more?
--
Martin Michlmayr
http://www.cyrius.com/
* Gordon Farquharson <[email protected]> [2006-12-21 21:20]:
> generating these files, pkgcache.bin grows to 12582912 bytes, and when
> apt-get finishes, pkgcache.bin is 6425533 bytes and srcpkgcache.bin is
> 64254483 bytes. This time, when apt-get exited, it had only created
> pkgcache.bin which was still 12582912 bytes.
Yes, same here:
sh-3.1# ls -l /var/cache/apt/
total 5252
drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives
-rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin
-rw-r--r-- 1 root root 8554 Dec 22 04:45 srcpkgcache.bin
Gordon, does it fail for you where it normally does (installing
initramfs-tools) or much later? For me, the installer was able to
install initramfs-tools and the kernel, but apt now hangs at "Select
and install software".
--
Martin Michlmayr
http://www.cyrius.com/
* Martin Michlmayr <[email protected]> [2006-12-22 11:00]:
> This time, however, I let the installer continue and it seems that
> with your patch apt now works where it failed in the past, but it
> hangs later on. It's pretty weird because I cannot even kill the
> process:
Okay, it's really weird. So apt-get just hangs doing nothing and I
cannot even kill it. I just tried to download strace via wget and
immediately when I started wget, the hanging apt-get process
continued.
--
Martin Michlmayr
http://www.cyrius.com/
* Martin Michlmayr <[email protected]> [2006-12-22 11:06]:
> Okay, it's really weird. So apt-get just hangs doing nothing and I
> cannot even kill it. I just tried to download strace via wget and
> immediately when I started wget, the hanging apt-get process
> continued.
... and now that we've completed this step, the apt cache has suddenly
been reduced (see Gordon's mail for an explanation) and it segfaults:
sh-3.1# ls -l /var/cache/apt/
total 12524
drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives
-rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin
-rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin
sh-3.1# apt-get -f install
Reading package lists... Done
Segmentation faulty tree... 50%
--
Martin Michlmayr
http://www.cyrius.com/
On Fri, 22 Dec 2006 11:00:04 +0100
Martin Michlmayr <[email protected]> wrote:
> > - if (TestClearPageDirty(page) && account_size)
> > + if (TestClearPageDirty(page) && account_size) {
> > + dec_zone_page_state(page, NR_FILE_DIRTY);
> > task_io_account_cancelled_write(account_size);
> > + }
>
> This hunk (on top of git from about 2 days ago and your latest patch)
> results in the installer hanging right at the start.
You'll need this also:
From: Andrew Morton <[email protected]>
Only (un)account for IO and page-dirtying for devices which have real backing
store (ie: not tmpfs or ramdisks).
Cc: "David S. Miller" <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/truncate.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff -puN mm/truncate.c~truncate-dirty-memory-accounting-fix mm/truncate.c
--- a/mm/truncate.c~truncate-dirty-memory-accounting-fix
+++ a/mm/truncate.c
@@ -60,7 +60,8 @@ void cancel_dirty_page(struct page *page
WARN_ON(++warncount < 5);
}
- if (TestClearPageDirty(page) && account_size) {
+ if (TestClearPageDirty(page) && account_size &&
+ mapping_cap_account_dirty(page->mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
task_io_account_cancelled_write(account_size);
}
_
* Martin Michlmayr <[email protected]> [2006-12-22 11:10]:
> > immediately when I started wget, the hanging apt-get process
> > continued.
> ... and now that we've completed this step, the apt cache has suddenly
> been reduced (see Gordon's mail for an explanation) and it segfaults:
One of my questions was why apt-get worked to install the
initramfs-tools, the kernel and some other packages but later hung
while it was building the cache (which clearly it had built already to
install some packages): before the installer offers to install
additional packages, it changes the apt sources, which leads to apt
rebuilding the cache, and here it hangs.
Remember how I said that downloading a file with wget prompts apt to
work again? Apparently any filesystem access will do (I just ran
find / > /dev/null). Gordon, can you confirm this?
--
Martin Michlmayr
http://www.cyrius.com/
* Andrew Morton <[email protected]> [2006-12-22 02:17]:
> > This hunk (on top of git from about 2 days ago and your latest patch)
> > results in the installer hanging right at the start.
>
> You'll need this also:
It starts again, thanks.
--
Martin Michlmayr
http://www.cyrius.com/
With all three patches I have corruption....
diff --git a/fs/buffer.c b/fs/buffer.c
index d1f1b54..263f88e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2845,22 +2845,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- *
- * Also, during truncate, discard_buffer will have marked all
- * the page's buffers clean. We discover that here and clean
- * the page also.
- */
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed2c223..4f4cd13 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff --git a/include/asm-generic/pgtable.h
b/include/asm-generic/pgtable.h
index 9d774d0..8879f1d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -61,31 +61,6 @@ ({ \
})
#endif
-#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_DIRTY
-#define ptep_test_and_clear_dirty(__vma, __address, __ptep) \
-({ \
- pte_t __pte = *__ptep; \
- int r = 1; \
- if (!pte_dirty(__pte)) \
- r = 0; \
- else \
- set_pte_at((__vma)->vm_mm, (__address), (__ptep), \
- pte_mkclean(__pte)); \
- r; \
-})
-#endif
-
-#ifndef __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(__vma, __address, __ptep) \
-({ \
- int __dirty; \
- __dirty = ptep_test_and_clear_dirty(__vma, __address, __ptep); \
- if (__dirty) \
- flush_tlb_page(__vma, __address); \
- __dirty; \
-})
-#endif
-
#ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
#define ptep_get_and_clear(__mm, __address, __ptep) \
({ \
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..b61d6f9 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -300,18 +300,20 @@ do { \
flush_tlb_page(vma, address); \
} while (0)
-#define __HAVE_ARCH_PTEP_CLEAR_DIRTY_FLUSH
-#define ptep_clear_flush_dirty(vma, address, ptep) \
-({ \
- int __dirty; \
- __dirty = pte_dirty(*(ptep)); \
- if (__dirty) { \
- clear_bit(_PAGE_BIT_DIRTY, &(ptep)->pte_low); \
- pte_update_defer((vma)->vm_mm, (address), (ptep)); \
- flush_tlb_page(vma, address); \
- } \
- __dirty; \
-})
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+ (old).pte_low = xchg(&(ptep)->pte_low, (new).pte_low);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+ pte_dirty(old)
#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
#define ptep_clear_flush_young(vma, address, ptep) \
diff --git a/include/asm-x86_64/pgtable.h b/include/asm-x86_64/pgtable.h
index 59901c6..07754b5 100644
--- a/include/asm-x86_64/pgtable.h
+++ b/include/asm-x86_64/pgtable.h
@@ -283,12 +283,20 @@ static inline pte_t pte_clrhuge(pte_t pt
struct vm_area_struct;
-static inline int ptep_test_and_clear_dirty(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
-{
- if (!pte_dirty(*ptep))
- return 0;
- return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte);
-}
+/*
+ * "ptep_exchange()" can be used to atomically change a set of
+ * page table protection bits, returning the old ones (the dirty
+ * and accessed bits in particular, since they are set by hw).
+ *
+ * "ptep_flush_dirty()" then returns the dirty status of the
+ * page (on x86-64, we just look at the dirty bit in the returned
+ * pte, but some other architectures have the dirty bits in
+ * other places than the page tables).
+ */
+#define ptep_exchange(vma, address, ptep, old, new) \
+ (old).pte = xchg(&(ptep)->pte, (new).pte);
+#define ptep_flush_dirty(vma, address, ptep, old) \
+ pte_dirty(old)
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep)
{
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4830a3b..350878a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -253,15 +253,11 @@ #define ClearPageUncached(page) clear_bi
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int
account_size);
+
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
-{
- test_clear_page_dirty(page);
-}
-
static inline void set_page_writeback(struct page *page)
{
test_set_page_writeback(page);
diff --git a/mm/memory.c b/mm/memory.c
index c00bac6..79cecab 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1842,6 +1842,33 @@ void unmap_mapping_range(struct address_
}
EXPORT_SYMBOL(unmap_mapping_range);
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr,KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk("%s: BADNESS: truncate check %u\n", current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1875,6 +1902,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
out_truncate:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 237107c..b3a198c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -845,38 +845,6 @@ int set_page_dirty_lock(struct page *pag
EXPORT_SYMBOL(set_page_dirty_lock);
/*
- * Clear a page's dirty flag, while caring for dirty memory
accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
- struct address_space *mapping = page_mapping(page);
- unsigned long flags;
-
- if (!mapping)
- return TestClearPageDirty(page);
-
- write_lock_irqsave(&mapping->tree_lock, flags);
- if (TestClearPageDirty(page)) {
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- /*
- * We can continue to use `mapping' here because the
- * page is locked, which pins the address_space
- */
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
- dec_zone_page_state(page, NR_FILE_DIRTY);
- }
- return 1;
- }
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- return 0;
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*
diff --git a/mm/rmap.c b/mm/rmap.c
index d8a842a..a028803 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *ptep;
spinlock_t *ptl;
int ret = 0;
@@ -440,22 +440,24 @@ static int page_mkclean_one(struct page
if (address == -EFAULT)
goto out;
- pte = page_check_address(page, mm, address, &ptl);
- if (!pte)
- goto out;
-
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
-
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
-
-unlock:
- pte_unmap_unlock(pte, ptl);
+ ptep = page_check_address(page, mm, address, &ptl);
+ if (ptep) {
+ pte_t old, new;
+
+ old = *ptep;
+ new = pte_wrprotect(pte_mkclean(old));
+ if (!pte_same(old, new)) {
+ for (;;) {
+ flush_cache_page(vma, address, page_to_pfn(page));
+ ptep_exchange(vma, address, ptep, old, new);
+ if (pte_same(old, new))
+ break;
+ ret |= ptep_flush_dirty(vma, address, ptep, old);
+ flush_tlb_page(vma, address);
+ }
+ }
+ pte_unmap_unlock(pte, ptl);
+ }
out:
return ret;
}
diff --git a/mm/truncate.c b/mm/truncate.c
index 9bfb8e8..4a38dd1 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -51,6 +51,22 @@ static inline void truncate_partial_page
do_invalidatepage(page, partial);
}
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+ /* If we're cancelling the page, it had better not be mapped any more
*/
+ if (page_mapped(page)) {
+ static unsigned int warncount;
+
+ WARN_ON(++warncount < 5);
+ }
+
+ if (TestClearPageDirty(page) && account_size &&
+ mapping_cap_account_dirty(page->mapping)) {
+ dec_zone_page_state(page, NR_FILE_DIRTY);
+ task_io_account_cancelled_write(account_size);
+ }
+}
+
/*
* If truncate cannot remove the fs-private metadata from the page, the
page
* becomes anonymous. It will be left on the LRU and may even be
mapped into
@@ -67,11 +83,11 @@ truncate_complete_page(struct address_sp
if (page->mapping != mapping)
return;
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
if (PagePrivate(page))
do_invalidatepage(page, 0);
- if (test_clear_page_dirty(page))
- task_io_account_cancelled_write(PAGE_CACHE_SIZE);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
remove_from_page_cache(page);
@@ -350,7 +366,6 @@ int invalidate_inode_pages2_range(struct
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
- int was_dirty;
lock_page(page);
if (page->mapping != mapping) {
@@ -386,12 +401,8 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page2(mapping, page)) {
- if (was_dirty)
- set_page_dirty(page);
+ if (!invalidate_complete_page2(mapping, page))
ret = -EIO;
- }
unlock_page(page);
}
pagevec_release(&pvec);
On Fri, 2006-12-22 at 02:17 -0800, Andrew Morton wrote:
> On Fri, 22 Dec 2006 11:00:04 +0100
> Martin Michlmayr <[email protected]> wrote:
>
> > > - if (TestClearPageDirty(page) && account_size)
> > > + if (TestClearPageDirty(page) && account_size) {
> > > + dec_zone_page_state(page, NR_FILE_DIRTY);
> > > task_io_account_cancelled_write(account_size);
> > > + }
> >
> > This hunk (on top of git from about 2 days ago and your latest patch)
> > results in the installer hanging right at the start.
>
> You'll need this also:
>
> From: Andrew Morton <[email protected]>
>
> Only (un)account for IO and page-dirtying for devices which have real backing
> store (ie: not tmpfs or ramdisks).
>
> Cc: "David S. Miller" <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/truncate.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff -puN mm/truncate.c~truncate-dirty-memory-accounting-fix mm/truncate.c
> --- a/mm/truncate.c~truncate-dirty-memory-accounting-fix
> +++ a/mm/truncate.c
> @@ -60,7 +60,8 @@ void cancel_dirty_page(struct page *page
> WARN_ON(++warncount < 5);
> }
>
> - if (TestClearPageDirty(page) && account_size) {
> + if (TestClearPageDirty(page) && account_size &&
> + mapping_cap_account_dirty(page->mapping)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> task_io_account_cancelled_write(account_size);
> }
> _
>
* Andrei Popa <[email protected]> [2006-12-22 14:24]:
> With all three patches I have corruption....
I've completed one installation with Linus' patch plus the two from
Andrew successfully, but I'm currently trying again... but I really
need a better testcase since an installation takes about an hour.
Andrei, which torrent do you download as a testcase? It would be good
if someone could suggest a torrent which is legal and not too large.
--
Martin Michlmayr
http://www.cyrius.com/
* Martin Michlmayr <[email protected]> [2006-12-22 13:32]:
> I've completed one installation with Linus' patch plus the two from
> Andrew successfully, but I'm currently trying again...
... and it failed.
--
Martin Michlmayr
http://www.cyrius.com/
Marc Haber wrote:
> After updating to 2.6.19, Debian's apt control file
> /var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> six hours. In that situation, "aptitude update" segfaults. When I
> delete the file and have apt recreate it, things are fine again for a
> few hours before the file is broken again and the segfault start over.
> In all cases, umounting the file system and doing an fsck does not
> show issues with the file system.
Are you using wireless networking of any kind? If so which driver and
security key system? Might be useful if you could post 'dmesg' output so
that people can see the other hardware that you have.
Daniel
On Fri, 2006-12-22 at 13:59 +0100, Martin Michlmayr wrote:
> * Martin Michlmayr <[email protected]> [2006-12-22 13:32]:
> > I've completed one installation with Linus' patch plus the two from
> > Andrew successfully, but I'm currently trying again...
>
> .... and it failed.
Since you are on ARM you might want to try with the page_mkclean_one
cleanup patch too.
Arjan agreed that the loop is not needed; we clear the pte, flush on all
CPUs and then re-establish the pte. Any race will fault and be
serialised on the pte lock.
FWIW - with todays -git and Andrews second cancel_dirty_page() patch:
http://lkml.org/lkml/2006/12/22/49
I am unable to trigger any corruption - I could again earlier by raising
the number of seeds from 3 to 6. (am currently at 10 seeds)
From: Peter Zijlstra <[email protected]>
fix page_mkclean_one()
- add flush_cache_page() for all those virtual indexed cache
architectures.
- handle s390.
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/rmap.c | 38 +++++++++++++++++++++++++-------------
1 file changed, 25 insertions(+), 13 deletions(-)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *pte;
spinlock_t *ptl;
int ret = 0;
@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
if (!pte)
goto out;
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
+ if (pte_dirty(*pte) || pte_write(*pte)) {
+ pte_t entry;
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ entry = ptep_clear_flush(vma, address, pte);
+ entry = pte_wrprotect(entry);
+ entry = pte_mkclean(entry);
+ set_pte_at(vma, address, pte, entry);
+ lazy_mmu_prot_update(entry);
+ ret = 1;
+ }
-unlock:
pte_unmap_unlock(pte, ptl);
out:
return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
if (mapping)
ret = page_mkclean_file(mapping, page);
}
+ if (page_test_and_clear_dirty(page))
+ ret = 1;
return ret;
}
A cleanup of try_to_unmap. I have not identified any races that this
would solve, but for consistencies sake.
Also includes a small s390 optimization by moving
page_test_and_clear_dirty() out of the vma iteration.
From: Peter Zijlstra <[email protected]>
We clear the page in the following sequence:
ClearPageDirty - lock ptl, clear pte, unlock ptl
hence we should dirty in the opposite order:
lock ptl, clear pte, unlock ptl - SetPageDirty
try_to_unmap_one violates this by doing the SetPageDirty under the ptl.
Also move page_test_and_clear_dirty() to try_to_unmap().
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/rmap.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c
+++ linux-2.6/mm/rmap.c
@@ -590,8 +590,6 @@ void page_remove_rmap(struct page *page)
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/
- if (page_test_and_clear_dirty(page))
- set_page_dirty(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES : NR_FILE_MAPPED);
}
@@ -610,6 +608,7 @@ static int try_to_unmap_one(struct page
pte_t pteval;
spinlock_t *ptl;
int ret = SWAP_AGAIN;
+ struct page *dirty_page = NULL;
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -636,7 +635,7 @@ static int try_to_unmap_one(struct page
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ dirty_page = page;
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
@@ -687,6 +686,8 @@ static int try_to_unmap_one(struct page
out_unmap:
pte_unmap_unlock(pte, ptl);
+ if (dirty_page)
+ set_page_dirty(dirty_page);
out:
return ret;
}
@@ -918,6 +919,9 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);
+ if (page_test_and_clear_dirty(page))
+ set_page_dirty(page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
On 12/21/06, Linus Torvalds <[email protected]> wrote:
> Andrew located at least one bug: we run cancel_dirty_page() too late in
> "truncate_complete_page()", which means that do_invalidatepage() ends up
> not clearing the page cache.
>
> His patch is appended.
Thanks. I'll try this out later today.
> But it sounds like I probably misunderstood something, because I thought
> that Martin had acknowledged that this patch actually worked for him.
> Which sounded very similar to your setup (he has a 32M ARM box too, no?)
Yup, we have the same machines (Linksys NSLU2) and are running the
same test case (installing Debian). However, I'm not sure what kernel
version he had used for his latest test. I presumed 2.6.20-git,
whereas I had used 2.6.19.
> Maybe it's mount option issue? I've got data=ordered on my machine, are
> you perhaps runnign with something else?
We are also using ordered.
/dev/scsi/host0/bus0/target0/lun0/part1 /target ext3 rw,data=ordered 0 0
Gordon
--
Gordon Farquharson
On 12/22/06, Martin Michlmayr <[email protected]> wrote:
> sh-3.1# ls -l /var/cache/apt/
> total 5252
> drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives
> -rw-r--r-- 1 root root 12582912 Dec 22 04:45 pkgcache.bin
> -rw-r--r-- 1 root root 8554 Dec 22 04:45 srcpkgcache.bin
This listing is a little different to what I got. For me,
srcpkgcache.bin did not exist when apt eventually finished. Did you
notice whether the install took a lot longer than usual ?
> Gordon, does it fail for you where it normally does (installing
> initramfs-tools) or much later? For me, the installer was able to
> install initramfs-tools and the kernel, but apt now hangs at "Select
> and install software".
apt didn't hang for me, it just took 20 to 30 minutes to complete
building the package database. Usually, it takes less than a minute.
The installer stopped because it could not find a kernel to install. I
have seen this failure mde before, and as you have previously pointed
out, is probably the same problem (corrupted apt cache files), just a
different manifestation.
Gordon
--
Gordon Farquharson
On Fri, Dec 22, 2006 at 01:32:49PM +0100, Martin Michlmayr wrote:
> * Andrei Popa <[email protected]> [2006-12-22 14:24]:
> > With all three patches I have corruption....
>
> I've completed one installation with Linus' patch plus the two from
> Andrew successfully, but I'm currently trying again... but I really
> need a better testcase since an installation takes about an hour.
> Andrei, which torrent do you download as a testcase? It would be good
> if someone could suggest a torrent which is legal and not too large.
Hi everyone,
I have been reading this thread for the last few days, but have been
silent. I have 3 torrents here for testing, if you want.
You can easily reproduce with "rtorrent", if you:
- Have a completly downloaded one, no matter what size
- Corrupt the download with
dd if=/dev/zero of=download.file bs=16k count=1
- Restart 'rtorrent', hash-check fails
- It will download 1 piece that was corrupted.
The important part here is that rtorrent transfers one piece,
using its own code sequence to write to the file.
Let me offer to test until Saturday afternoon CET,
I have a cloned git repository, downloaded torrent files and "apt".
My systems that are affected are:
Linux oscar 2.6.18 SMP (2x450Mhz Intel P3)
(rolled back to 2.6.18 but can boot latest git)
Linux tony 2.6.20-git UP
(can be tested using all kinds of "apt" operations)
Both machines are using:
IDE -> MD-RAID1 -> LVM -> EXT3 (data=ordered)
SCSI -> MD-RAID5 -> .....
I don't want to disturb your technical discussion,
just offering some help in testing.
Regards,
Patrick
On 12/22/06, Martin Michlmayr <[email protected]> wrote:
> ... and now that we've completed this step, the apt cache has suddenly
> been reduced (see Gordon's mail for an explanation) and it segfaults:
>
> sh-3.1# ls -l /var/cache/apt/
> total 12524
> drwxr-xr-x 3 root root 12288 Dec 22 04:41 archives
> -rw-r--r-- 1 root root 6426885 Dec 22 05:03 pkgcache.bin
> -rw-r--r-- 1 root root 6426835 Dec 22 05:03 srcpkgcache.bin
> sh-3.1# apt-get -f install
> Reading package lists... Done
> Segmentation faulty tree... 50%
I think that we are seeing different manifestations of apt's response
to corrupted cache files. There does not appear to be any pattern to
which manifestation occurs. Maybe it depends on where in the cache
file the corruption is located, i.e. when the corruption occurs. Based
on the kernel gurus current knowledge of the problem, would you expect
the corruption to occur at the same point in a file, or is it possible
that the corruption could occur at different points on successive
Debian installer attempts on a UP, non PREEMPT system ?
Gordon
--
Gordon Farquharson
On Fri, Dec 22, 2006 at 08:30:06AM -0500, Daniel Drake wrote:
> Marc Haber wrote:
> >After updating to 2.6.19, Debian's apt control file
> >/var/cache/apt/pkgcache.bin corrupts pretty frequently - like in under
> >six hours. In that situation, "aptitude update" segfaults. When I
> >delete the file and have apt recreate it, things are fine again for a
> >few hours before the file is broken again and the segfault start over.
> >In all cases, umounting the file system and doing an fsck does not
> >show issues with the file system.
>
> Are you using wireless networking of any kind?
Since the system in question is a colocated server box, I am pretty
sure that there is no wireless networking.
> Might be useful if you could post 'dmesg' output so that people can
> see the other hardware that you have.
I have attached what I could scrape from syslog.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
On Sat, Dec 16, 2006 at 06:43:10PM +0000, Martin Michlmayr wrote:
> * Marc Haber <[email protected]> [2006-12-09 10:26]:
> > Unfortunately, I am lacking the knowledge needed to do this in an
> > informed way. I am neither familiar enough with git nor do I possess
> > the necessary C powers.
>
> I wonder if what you're seein is related to
> http://lkml.org/lkml/2006/12/16/73
>
> You said that you don't see any corruption with 2.6.18. Can you try
> to apply the patch from
> http://www2.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
> to 2.6.18 to see if the corruption shows up?
Since I am no longer seeing the issue after easing the memory load, I
doubt that this would make sense.
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
* Gordon Farquharson <[email protected]> [2006-12-22 08:30]:
> Based on the kernel gurus current knowledge of the problem, would
> you expect the corruption to occur at the same point in a file, or
> is it possible that the corruption could occur at different points
> on successive Debian installer attempts on a UP, non PREEMPT system?
Seems like it can occur anywhere. In fact, some people see apt
problems because of filesystem corruption on the NSLU2 after they have
already installe Debian. I've only seen this once myself and failed
many times to find a reproducible situation.
--
Martin Michlmayr
http://www.cyrius.com/
On Mon, 18 Dec 2006, Gene Heskett wrote:
>
> What about the mm/rmap.c one liner, in or out?
The one that just removes the "pte_mkclean()"? That's definitely out, it
was just a test-patch to verify that the pte dirty bits seemed to matter
at all (and they do).
Linus
On Fri, 22 Dec 2006, Peter Zijlstra wrote:
>
> fix page_mkclean_one()
>
> - add flush_cache_page() for all those virtual indexed cache
> architectures.
I think the flush_cache_page() should be after we've actually flushed it
from the TLB and re-inserted it (this is one reason why I did the
"ptep_exchange()" version of this). Otherwise somebody can still write to
the page _after_ the cache flush..
> - handle s390.
Yeah, that looks like the proper way to handle that.
That said, it looks like we still see corruption. You may not, but Martin
and Andrei still report problems, even with all the patches (including the
last one from Andrew that avoids "dirty" going negative under some
circumstances, and explains the "slow and/or never completed" case that
Gordon and Martin saw).
The good news is that I think the code now is cleaner and more
understandable. The bad news is that nothing we've ever tried seems to
have fixed the _problem_.
And I don't think it's page_mkclean(). Especially not since the ARM people
are seeing this under UP without PREEMPT. In that kind of schenario, the
only possible races tend to be from things that actually block:
"set_page_dirty()" (which blocks on IO in balancing), memory allocations,
and obviously doing actual IO.
And it's not a virtual cache problem, since others see it on x86.
Of course, since it's quite possibly two different issues, maybe the
virtual cache flush is required in order to force write-back to memory
(which in turn is required for the DMA for the actual write!). So the ARM
issue certainly could be due to the flush_cache_page() thing...
Linus
* Peter Zijlstra <[email protected]> [2006-12-22 14:25]:
> > .... and it failed.
> Since you are on ARM you might want to try with the page_mkclean_one
> cleanup patch too.
I've already tried it and it didn't work. I just tried it again
together with Linus' patch and the two from Andrew and it still fails.
(For reference, the patch is attached.)
--
Martin Michlmayr
http://www.cyrius.com/
On Fri, 2006-12-22 at 13:32 +0100, Martin Michlmayr wrote:
> * Andrei Popa <[email protected]> [2006-12-22 14:24]:
> > With all three patches I have corruption....
>
> I've completed one installation with Linus' patch plus the two from
> Andrew successfully, but I'm currently trying again... but I really
> need a better testcase since an installation takes about an hour.
> Andrei, which torrent do you download as a testcase? It would be good
> if someone could suggest a torrent which is legal and not too large.
It's a 1.4GB file torrent split in 84 rar files and there are many
seeders. I download with ~ 5MB/sec. The torrent is private.
On 12/22/06, Martin Michlmayr <[email protected]> wrote:
> * Peter Zijlstra <[email protected]> [2006-12-22 14:25]:
> > > .... and it failed.
> > Since you are on ARM you might want to try with the page_mkclean_one
> > cleanup patch too.
>
> I've already tried it and it didn't work. I just tried it again
> together with Linus' patch and the two from Andrew and it still fails.
> (For reference, the patch is attached.)
I can confirm this behaviour with 2.6.19 and the patches mentioned
above (cumulative patch for 2.6.19 appended to the end of this email).
Is there any way to provide any debugging information that may help
solve the problem ? Would it help to know the nature of the corruption
e.g. an analysis of the corruption in the file ? I have previously
asked apt developers if they wanted to look at the corrupted cache
files, but there were no takers then.
BTW, I decided to try Linus's test program [1] on ARM (I don't think
that anybody had tried it on ARM before).
Since we see file corruption with 2.6.18 + [PATCH] mm: tracking shared
dirty pages [2], I ran Linus's program on machines with the following
setups:
2.6.18 + the following patches
mm: tracking shared dirty pages [2]
mm: balance dirty pages [3]
mm: optimize the new mprotect() code a bit [4]
mm: small cleanup of install_page() [5]
mm: fixup do_wp_page() [6]
mm: msync() cleanup [7]
$ ./mm-test | od -x
0000000 aaaa aaaa aaaa aaaa aaaa 0000 0000 0000
0000020 0000 0000 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050
2.6.18 (no mm patches)
$ ./mm-test | od -x
0000000 aaaa aaaa aaaa aaaa aaaa aaaa aaaa aaaa
0000020 aaaa aaaa 5555 5555 5555 5555 5555 5555
0000040 5555 5555 5555 5555
0000050
I don't know if this helps at all.
Gordon
[1] http://lkml.org/lkml/2006/12/19/200
[2] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d08b3851da41d0ee60851f2c75b118e1f7a5fc89
[3] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
[4] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=c1e6098b23bb46e2b488fe9a26f831f867157483
[5] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=e88dd6c11c5aef74d8b74a062767add53315533b
[6] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ee6a6457886a80415db209e87033b63f2b06558c
[7] http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=204ec841fbea3e5138168edbc3a76d46747cc987
diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/fs/buffer.c 2006-12-21 01:16:31.000000000 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- */
- clear_page_dirty(page);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.000000000 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/include/linux/page-flags.h 2006-12-21
01:15:21.000000000 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
-{
- test_clear_page_dirty(page);
-}
-
static inline void set_page_writeback(struct page *page)
{
test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/memory.c 2006-12-21 01:15:21.000000000 -0700
@@ -1832,6 +1832,33 @@ void unmap_mapping_range(struct address_
}
EXPORT_SYMBOL(unmap_mapping_range);
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr,KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1865,6 +1892,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
out_truncate:
diff -Naupr linux-2.6.19.orig/mm/page-writeback.c
linux-2.6.19/mm/page-writeback.c
--- linux-2.6.19.orig/mm/page-writeback.c 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/mm/page-writeback.c 2006-12-21 01:26:53.000000000 -0700
@@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag
EXPORT_SYMBOL(set_page_dirty_lock);
/*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
- struct address_space *mapping = page_mapping(page);
- unsigned long flags;
-
- if (mapping) {
- write_lock_irqsave(&mapping->tree_lock, flags);
- if (TestClearPageDirty(page)) {
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page),
- PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- /*
- * We can continue to use `mapping' here because the
- * page is locked, which pins the address_space
- */
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
- dec_zone_page_state(page, NR_FILE_DIRTY);
- }
- return 1;
- }
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- return 0;
- }
- return TestClearPageDirty(page);
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*
diff -Naupr linux-2.6.19.orig/mm/rmap.c linux-2.6.19/mm/rmap.c
--- linux-2.6.19.orig/mm/rmap.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/rmap.c 2006-12-22 23:25:09.000000000 -0700
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *pte;
spinlock_t *ptl;
int ret = 0;
@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
if (!pte)
goto out;
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
+ if (pte_dirty(*pte) || pte_write(*pte)) {
+ pte_t entry;
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ entry = ptep_clear_flush(vma, address, pte);
+ entry = pte_wrprotect(entry);
+ entry = pte_mkclean(entry);
+ set_pte_at(vma, address, pte, entry);
+ lazy_mmu_prot_update(entry);
+ ret = 1;
+ }
-unlock:
pte_unmap_unlock(pte, ptl);
out:
return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
if (mapping)
ret = page_mkclean_file(mapping, page);
}
+ if (page_test_and_clear_dirty(page))
+ ret = 1;
return ret;
}
@@ -587,8 +590,6 @@ void page_remove_rmap(struct page *page)
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/
- if (page_test_and_clear_dirty(page))
- set_page_dirty(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES :
NR_FILE_MAPPED);
}
@@ -607,6 +608,7 @@ static int try_to_unmap_one(struct page
pte_t pteval;
spinlock_t *ptl;
int ret = SWAP_AGAIN;
+ struct page *dirty_page = NULL;
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -633,7 +635,7 @@ static int try_to_unmap_one(struct page
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ dirty_page = page;
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
@@ -684,6 +686,8 @@ static int try_to_unmap_one(struct page
out_unmap:
pte_unmap_unlock(pte, ptl);
+ if (dirty_page)
+ set_page_dirty(dirty_page);
out:
return ret;
}
@@ -915,6 +919,9 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);
+ if (page_test_and_clear_dirty(page))
+ set_page_dirty(page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c
--- linux-2.6.19.orig/mm/truncate.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/truncate.c 2006-12-23 13:21:42.000000000 -0700
@@ -50,6 +50,21 @@ static inline void truncate_partial_page
do_invalidatepage(page, partial);
}
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+ /* If we're cancelling the page, it had better not be mapped
any more */+ if (page_mapped(page)) {
+ static unsigned int warncount;
+
+ WARN_ON(++warncount < 5);
+ }
+
+ if (TestClearPageDirty(page) && account_size &&
+ mapping_cap_account_dirty(page->mapping))
+ dec_zone_page_state(page, NR_FILE_DIRTY);
+}
+
+
/*
* If truncate cannot remove the fs-private metadata from the page, the page
* becomes anonymous. It will be left on the LRU and may even be mapped into
@@ -66,10 +81,11 @@ truncate_complete_page(struct address_sp
if (page->mapping != mapping)
return;
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
if (PagePrivate(page))
do_invalidatepage(page, 0);
- clear_page_dirty(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
remove_from_page_cache(page);
@@ -348,7 +364,6 @@ int invalidate_inode_pages2_range(struct
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
- int was_dirty;
lock_page(page);
if (page->mapping != mapping) {
@@ -384,12 +399,8 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page2(mapping, page)) {
- if (was_dirty)
- set_page_dirty(page);
+ if (!invalidate_complete_page2(mapping, page))
ret = -EIO;
- }
unlock_page(page);
}
pagevec_release(&pvec);
--
Gordon Farquharson
On Sun, 24 Dec 2006, Gordon Farquharson wrote:
>
> Is there any way to provide any debugging information that may help
> solve the problem ?
I think we have people working on this. I know I'm trying to even come up
with an idea of what is going on. I don't think we know yet.
> Would it help to know the nature of the corruption e.g. an analysis
> of the corruption in the file ?
I actually think we know that, because Andrei already gave details. The
corruption seems to be basically a few pages that get zeroes at the end
rather than the expected contents. That's consistent with the page being
written out once, but then _not_ getting written out again despite being
dirtied some more.
But if you see ay other pattern, please holler, because that would be
interesting.
> BTW, I decided to try Linus's test program [1] on ARM (I don't think
> that anybody had tried it on ARM before).
You get the expected results, and in fact, I'd be very surprised if you
didn't. It's something subtler than that going on.
I now _suspect_ that we're talking about something like
- we started a writeout. The IO is still pending, and the page was
marked clean and is now in the "writeback" phase.
- a write happens to the page, and the page gets marked dirty again.
Marking the page dirty also marks all the _buffers_ in the page dirty,
but they were actually already dirty, because the IO hasn't completed
yet.
- the IO from the _previous_ write completes, and marks the buffers clean
again.
And no, thatr's not actually what is going on. The thing is, we actually
clear the buffer dirty bits when we start the IO, not when we end it, but
I think it is going to be this _kind_ of situation, where we missed
something, and marked it clean too late, and thus cleared a dirty bit.
I don't think it's a page table issue any more, it just doesn't look
likely with the ARM UP corruption. It's also not apparently even on a
cacheline boundary, so it probably is really a dirty bit that got cleared
wrogn due to some race with IO.
But right now we're all clueless. I personally suspect it's not even a new
bug: it's probably an old bug that simply didn't matter before.
Linus
On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> I now _suspect_ that we're talking about something like
>
> - we started a writeout. The IO is still pending, and the page was
> marked clean and is now in the "writeback" phase.
> - a write happens to the page, and the page gets marked dirty again.
> Marking the page dirty also marks all the _buffers_ in the page dirty,
> but they were actually already dirty, because the IO hasn't completed
> yet.
> - the IO from the _previous_ write completes, and marks the buffers clean
> again.
Some things for the testers to try, please:
- mount the fs with ext2 with the no-buffer-head option. That means either:
grub.conf: rootfstype=ext2 rootflags=nobh
/etc/fstab: ext2 nobh
- mount the fs with ext3 data=writeback, nobh
grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works)
/etc/fstab: ext2 data=writeback,nobh
if that still fails we can rule out buffer_head funnies.
On Sun, 24 Dec 2006, Andrew Morton wrote:
>
> > I now _suspect_ that we're talking about something like
> >
> > - we started a writeout. The IO is still pending, and the page was
> > marked clean and is now in the "writeback" phase.
> > - a write happens to the page, and the page gets marked dirty again.
> > Marking the page dirty also marks all the _buffers_ in the page dirty,
> > but they were actually already dirty, because the IO hasn't completed
> > yet.
> > - the IO from the _previous_ write completes, and marks the buffers clean
> > again.
>
> Some things for the testers to try, please:
>
> - mount the fs with ext2 with the no-buffer-head option. That means either:
[ snip snip ]
This is definitely worth testing, but the exact schenario I outlined is
probably not the thing that happens. It was really meant to be more of an
exmple of the _kind_ of situation I think we might have.
That would explain why we didn't see this before: we simply didn't mark
pages clean all that aggressively, and an app like rtorrent would normally
have caused its flushes to happen _synchronously_ by using msync() (even
if the IO itself was done asynchronously, all the dirty bit stuff would be
synchronous wrt any rtorrent behaviour).
And the things that /did/ use to clean pages asynchronously (VM scanning)
would always actually look at the "young" bit (aka "accessed") and not
even touch the dirty bit if an application had accessed the page recently,
so that basically avoided any likely races, because we'd touch the dirty
bit ONLY if the page was "cold".
So this is why I'm saying that it might be an old bug, and it would be
just the new pattern of handling dirty bits that triggers it.
But avoiding buffer heads and testing that part is worth doing. Just to
remove one thing from the equation.
Linus
On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote:
> On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
> > I now _suspect_ that we're talking about something like
> >
> > - we started a writeout. The IO is still pending, and the page was
> > marked clean and is now in the "writeback" phase.
> > - a write happens to the page, and the page gets marked dirty again.
> > Marking the page dirty also marks all the _buffers_ in the page dirty,
> > but they were actually already dirty, because the IO hasn't completed
> > yet.
> > - the IO from the _previous_ write completes, and marks the buffers clean
> > again.
>
> Some things for the testers to try, please:
>
> - mount the fs with ext2 with the no-buffer-head option. That means either:
>
> grub.conf: rootfstype=ext2 rootflags=nobh
> /etc/fstab: ext2 nobh
ierdnac ~ # mount
/dev/sda7 on / type ext2 (rw,noatime,nobh)
I have corruption.
>
> - mount the fs with ext3 data=writeback, nobh
>
> grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works)
> /etc/fstab: ext2 data=writeback,nobh
ierdnac ~ # mount
/dev/sda7 on / type ext3 (rw,noatime,nobh)
ierdnac ~ # dmesg|grep EXT3
EXT3-fs: mounted filesystem with writeback data mode.
EXT3 FS on sda7, internal journal
I don't have corruption. I tested twice.
>
> if that still fails we can rule out buffer_head funnies.
>
On Sun, 2006-12-24 at 14:14 +0200, Andrei Popa wrote:
> On Sun, 2006-12-24 at 00:57 -0800, Andrew Morton wrote:
> > On Sun, 24 Dec 2006 00:43:54 -0800 (PST)
> > Linus Torvalds <[email protected]> wrote:
> >
> > > I now _suspect_ that we're talking about something like
> > >
> > > - we started a writeout. The IO is still pending, and the page was
> > > marked clean and is now in the "writeback" phase.
> > > - a write happens to the page, and the page gets marked dirty again.
> > > Marking the page dirty also marks all the _buffers_ in the page dirty,
> > > but they were actually already dirty, because the IO hasn't completed
> > > yet.
> > > - the IO from the _previous_ write completes, and marks the buffers clean
> > > again.
> >
> > Some things for the testers to try, please:
> >
> > - mount the fs with ext2 with the no-buffer-head option. That means either:
> >
> > grub.conf: rootfstype=ext2 rootflags=nobh
> > /etc/fstab: ext2 nobh
>
> ierdnac ~ # mount
> /dev/sda7 on / type ext2 (rw,noatime,nobh)
>
> I have corruption.
>
> >
> > - mount the fs with ext3 data=writeback, nobh
> >
> > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works)
> > /etc/fstab: ext2 data=writeback,nobh
>
> ierdnac ~ # mount
> /dev/sda7 on / type ext3 (rw,noatime,nobh)
>
> ierdnac ~ # dmesg|grep EXT3
> EXT3-fs: mounted filesystem with writeback data mode.
> EXT3 FS on sda7, internal journal
>
> I don't have corruption. I tested twice.
>
I also tested with ext3 ordered, nobh and I have file corruption...
> >
> > if that still fails we can rule out buffer_head funnies.
> >
On Sun, 24 Dec 2006 14:26:01 +0200
Andrei Popa <[email protected]> wrote:
> I also tested with ext3 ordered, nobh and I have file corruption...
ordered+nobh isn't a possible combination. The filesystem probably ignored
nobh. nobh mode only makes sense with data=writeback.
On Sun, 24 Dec 2006 14:14:38 +0200
Andrei Popa <[email protected]> wrote:
> > - mount the fs with ext2 with the no-buffer-head option. That means either:
> >
> > grub.conf: rootfstype=ext2 rootflags=nobh
> > /etc/fstab: ext2 nobh
>
> ierdnac ~ # mount
> /dev/sda7 on / type ext2 (rw,noatime,nobh)
>
> I have corruption.
>
> >
> > - mount the fs with ext3 data=writeback, nobh
> >
> > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works)
> > /etc/fstab: ext2 data=writeback,nobh
>
> ierdnac ~ # mount
> /dev/sda7 on / type ext3 (rw,noatime,nobh)
>
> ierdnac ~ # dmesg|grep EXT3
> EXT3-fs: mounted filesystem with writeback data mode.
> EXT3 FS on sda7, internal journal
>
> I don't have corruption. I tested twice.
This is a surprising result. Can you pleas retest ext3 data=writeback,nobh?
* Andrew Morton <[email protected]> [2006-12-24 00:57]:
> /etc/fstab: ext2 nobh
> /etc/fstab: ext3 data=writeback,nobh
It seems that busybox mount ignores the nobh option but both ext2 and
ext3 data=writeback work for me. This is with plain 2.6.19 which
normally always fails.
--
Martin Michlmayr
http://www.cyrius.com/
On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> On Sun, 24 Dec 2006 14:14:38 +0200
> Andrei Popa <[email protected]> wrote:
>
> > > - mount the fs with ext2 with the no-buffer-head option. That means either:
> > >
> > > grub.conf: rootfstype=ext2 rootflags=nobh
> > > /etc/fstab: ext2 nobh
> >
> > ierdnac ~ # mount
> > /dev/sda7 on / type ext2 (rw,noatime,nobh)
> >
> > I have corruption.
> >
> > >
> > > - mount the fs with ext3 data=writeback, nobh
> > >
> > > grub.conf: rootfstype=ext3 rootflags=nobh,data=writeback (I hope this works)
> > > /etc/fstab: ext2 data=writeback,nobh
> >
> > ierdnac ~ # mount
> > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> >
> > ierdnac ~ # dmesg|grep EXT3
> > EXT3-fs: mounted filesystem with writeback data mode.
> > EXT3 FS on sda7, internal journal
> >
> > I don't have corruption. I tested twice.
>
> This is a surprising result. Can you pleas retest ext3 data=writeback,nobh?
Yes, no corruption. Also tested only with data=writeback and had no
corruption.
On Sun, 24 Dec 2006, Andrei Popa wrote:
> On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> > Andrei Popa <[email protected]> wrote:
> > > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > >
> > > I don't have corruption. I tested twice.
> >
> > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh?
>
> Yes, no corruption. Also tested only with data=writeback and had no
> corruption.
Ok, so it would seem to be writeback related _somehow_. However, most of
the differences (I _thought_) in ext3 actually show up only if you have
*both* "nobh" and "data=writeback", and as far as I can tell, just a
simple "data=writeback" should still use the bog-standard
"block_write_full_page()".
Andrew?
Although as far as I can see, then ext2 should work as-is too (since it
too also just uses "block_write_full_page()" without anything fancy).
Strange.
How about this particularly stupid diff? (please test with something that
_would_ cause corruption normally).
It is _entirely_ untested, but what it tries to do is to simply serialize
any writeback in progress with any process that tries to re-map a shared
page into its address space and dirty it. I haven't tested it, and maybe
it misses some case, but it looks likea good way to try to avoid races
with marking pages dirty and the writeback phase ..
Linus
---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..64ed10b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!pte_same(*page_table, orig_pte))
goto unlock;
}
+ wait_on_page_writeback(old_page);
dirty_page = old_page;
get_page(dirty_page);
reuse = 1;
@@ -2215,6 +2216,7 @@ retry:
page_cache_release(new_page);
return VM_FAULT_SIGBUS;
}
+ wait_on_page_writeback(new_page);
}
}
On Sun, 24 Dec 2006 09:16:06 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Sun, 24 Dec 2006, Andrei Popa wrote:
>
> > On Sun, 2006-12-24 at 04:31 -0800, Andrew Morton wrote:
> > > Andrei Popa <[email protected]> wrote:
> > > > /dev/sda7 on / type ext3 (rw,noatime,nobh)
> > > >
> > > > I don't have corruption. I tested twice.
> > >
> > > This is a surprising result. Can you pleas retest ext3 data=writeback,nobh?
> >
> > Yes, no corruption. Also tested only with data=writeback and had no
> > corruption.
>
> Ok, so it would seem to be writeback related _somehow_. However, most of
> the differences (I _thought_) in ext3 actually show up only if you have
> *both* "nobh" and "data=writeback", and as far as I can tell, just a
> simple "data=writeback" should still use the bog-standard
> "block_write_full_page()".
>
> Andrew?
>
> Although as far as I can see, then ext2 should work as-is too (since it
> too also just uses "block_write_full_page()" without anything fancy).
ext2 uses the multipage-bio assembly code for writeback whereas ext3
doesn't. But ext3 doesn't use that code in data=ordered mode, of course.
Still, this:
--- a/fs/ext2/inode.c~a
+++ a/fs/ext2/inode.c
@@ -693,7 +693,7 @@ const struct address_space_operations ex
.commit_write = generic_commit_write,
.bmap = ext2_bmap,
.direct_IO = ext2_direct_IO,
- .writepages = ext2_writepages,
+// .writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
};
@@ -711,7 +711,7 @@ const struct address_space_operations ex
.commit_write = nobh_commit_write,
.bmap = ext2_bmap,
.direct_IO = ext2_direct_IO,
- .writepages = ext2_writepages,
+// .writepages = ext2_writepages,
.migratepage = buffer_migrate_page,
};
_
will switch it off for ext2.
> Strange.
>
> How about this particularly stupid diff? (please test with something that
> _would_ cause corruption normally).
>
> It is _entirely_ untested, but what it tries to do is to simply serialize
> any writeback in progress with any process that tries to re-map a shared
> page into its address space and dirty it. I haven't tested it, and maybe
> it misses some case, but it looks likea good way to try to avoid races
> with marking pages dirty and the writeback phase ..
>
> Linus
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..64ed10b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1544,6 +1544,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> if (!pte_same(*page_table, orig_pte))
> goto unlock;
> }
> + wait_on_page_writeback(old_page);
> dirty_page = old_page;
> get_page(dirty_page);
> reuse = 1;
> @@ -2215,6 +2216,7 @@ retry:
> page_cache_release(new_page);
> return VM_FAULT_SIGBUS;
> }
> + wait_on_page_writeback(new_page);
> }
> }
yup. Also, we could perhaps lock the target page during pagefaults..
On Sun, 24 Dec 2006, Linus Torvalds wrote:
>
> How about this particularly stupid diff? (please test with something that
> _would_ cause corruption normally).
Actually, here's an even more stupid diff, which actually to some degree
seems to capture the real problem better.
Peter, tell me I'm crazy, but with the new rules, the following condition
is a bug:
- shared mapping
- writable
- not already marked dirty in the PTE
because that combination means that the hardware can mark the PTE dirty
without us even realizing (and thus not marking the "struct page *"
dirty).
(The above is actually a valid situation for IO mappings, but not for
"real" mappings. And IO mappings should never take page faults, I think).
So, with that in mind, I wrote this stupid patch (for 32-bit x86, since I
used my Mac Mini for testing ratehr than my main machine - but the x86-64
version should be pretty much identcal)..
And you know what, Peter? It triggers for me. I get
WARNING at mm/memory.c:2274 do_no_page()
[<c0103d4a>] show_trace_log_lvl+0x1a/0x2f
[<c010436c>] show_trace+0x12/0x14
[<c01043f0>] dump_stack+0x16/0x18
[<c0159790>] __handle_mm_fault+0x38d/0x919
[<c011c8c4>] do_page_fault+0x1ff/0x507
[<c03fabcc>] error_code+0x7c/0x84
which seems to say that do_no_page() can be used to insert shared and
non-dirty, but still writable, pages.
But maybe my patch is just bogus, and I didn't think it through.
Peter, I realize it's Christmas Eve, but let's face it, Santa appreciates
good boys and girls, and we all want tons of loot. So please be good, and
waste some time looking at this and tell me why I'm either wrong, or
there's a real smoking gun here.. ;)
Linus
---
diff --git a/include/asm-i386/pgtable.h b/include/asm-i386/pgtable.h
index e6a4723..1389bb7 100644
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -494,7 +494,13 @@ do { \
* The i386 doesn't have any external MMU info: the kernel page
* tables contain all the necessary information.
*/
-#define update_mmu_cache(vma,address,pte) do { } while (0)
+#define bad_shared_pte(pte) (pte_write(pte) && !pte_dirty(pte))
+#define update_mmu_cache(vma,address,pte) do { \
+ static int __cnt; \
+ WARN_ON(((vma)->vm_flags & VM_SHARED) \
+ && bad_shared_pte(pte) \
+ && ++__cnt < 5); \
+} while (0)
#endif /* !__ASSEMBLY__ */
#ifdef CONFIG_FLATMEM
On Sun, 24 Dec 2006, Linus Torvalds wrote:
>
> Peter, tell me I'm crazy, but with the new rules, the following condition
> is a bug:
>
> - shared mapping
> - writable
> - not already marked dirty in the PTE
Ok, so how about this diff.
I'm actually feeling good about this one. It really looks like
"do_no_page()" was simply buggy, and that this explains everything.
Please please please test. Throw all the other patches away (with the
possible exception of the "update_mmu_cache()" sanity checker, which is
still interesting in case some _other_ place does this too).
Don't do the "wait_on_page_writeback()" thing, because it changes timings
and might hide thngs for the wrong reasons. Just apply this on top of a
known failing kernel, and test.
Linus
---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..cf429c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2247,21 +2249,23 @@ retry:
if (pte_none(*page_table)) {
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
- if (write_access)
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
+ if (write_access)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+ entry = pte_wrprotect(entry);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
}
+ set_pte_at(mm, address, page_table, entry);
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
On 12/24/06, Linus Torvalds <[email protected]> wrote:
> How about this particularly stupid diff? (please test with something that
> _would_ cause corruption normally).
>
> It is _entirely_ untested, but what it tries to do is to simply serialize
> any writeback in progress with any process that tries to re-map a shared
> page into its address space and dirty it. I haven't tested it, and maybe
> it misses some case, but it looks likea good way to try to avoid races
> with marking pages dirty and the writeback phase ..
The apt cache files (/var/cache/apt/*.bin) still get corrupted with
this patch and 2.6.19.
Gordon
diff -Naupr linux-2.6.19.orig/fs/buffer.c linux-2.6.19/fs/buffer.c
--- linux-2.6.19.orig/fs/buffer.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/fs/buffer.c 2006-12-21 01:16:31.000000000 -0700
@@ -2832,7 +2832,7 @@ int try_to_free_buffers(struct page *pag
int ret = 0;
BUG_ON(!PageLocked(page));
- if (PageWriteback(page))
+ if (PageDirty(page) || PageWriteback(page))
return 0;
if (mapping == NULL) { /* can this still happen? */
@@ -2843,17 +2843,6 @@ int try_to_free_buffers(struct page *pag
spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
- if (ret) {
- /*
- * If the filesystem writes its buffers by hand (eg ext3)
- * then we can have clean buffers against a dirty page. We
- * clean the page here; otherwise later reattachment of buffers
- * could encounter a non-uptodate page, which is unresolvable.
- * This only applies in the rare case where try_to_free_buffers
- * succeeds but the page is not freed.
- */
- clear_page_dirty(page);
- }
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
diff -Naupr linux-2.6.19.orig/fs/hugetlbfs/inode.c
linux-2.6.19/fs/hugetlbfs/inode.c
--- linux-2.6.19.orig/fs/hugetlbfs/inode.c 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/fs/hugetlbfs/inode.c 2006-12-21 01:15:21.000000000 -0700
@@ -176,7 +176,7 @@ static int hugetlbfs_commit_write(struct
static void truncate_huge_page(struct page *page)
{
- clear_page_dirty(page);
+ cancel_dirty_page(page, /* No IO accounting for huge pages? */0);
ClearPageUptodate(page);
remove_from_page_cache(page);
put_page(page);
diff -Naupr linux-2.6.19.orig/include/linux/page-flags.h
linux-2.6.19/include/linux/page-flags.h
--- linux-2.6.19.orig/include/linux/page-flags.h 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/include/linux/page-flags.h 2006-12-21
01:15:21.000000000 -0700
@@ -253,15 +253,11 @@ static inline void SetPageUptodate(struc
struct page; /* forward declaration */
-int test_clear_page_dirty(struct page *page);
+extern void cancel_dirty_page(struct page *page, unsigned int account_size);
+
int test_clear_page_writeback(struct page *page);
int test_set_page_writeback(struct page *page);
-static inline void clear_page_dirty(struct page *page)
-{
- test_clear_page_dirty(page);
-}
-
static inline void set_page_writeback(struct page *page)
{
test_set_page_writeback(page);
diff -Naupr linux-2.6.19.orig/mm/memory.c linux-2.6.19/mm/memory.c
--- linux-2.6.19.orig/mm/memory.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/memory.c 2006-12-24 11:04:03.000000000 -0700
@@ -1534,6 +1534,7 @@ static int do_wp_page(struct mm_struct *
if (!pte_same(*page_table, orig_pte))
goto unlock;
}
+ wait_on_page_writeback(old_page);
dirty_page = old_page;
get_page(dirty_page);
reuse = 1;
@@ -1832,6 +1833,33 @@ void unmap_mapping_range(struct address_
}
EXPORT_SYMBOL(unmap_mapping_range);
+static void check_last_page(struct address_space *mapping, loff_t size)
+{
+ pgoff_t index;
+ unsigned int offset;
+ struct page *page;
+
+ if (!mapping)
+ return;
+ offset = size & ~PAGE_MASK;
+ if (!offset)
+ return;
+ index = size >> PAGE_SHIFT;
+ page = find_lock_page(mapping, index);
+ if (page) {
+ unsigned int check = 0;
+ unsigned char *kaddr = kmap_atomic(page, KM_USER0);
+ do {
+ check += kaddr[offset++];
+ } while (offset < PAGE_SIZE);
+ kunmap_atomic(kaddr,KM_USER0);
+ unlock_page(page);
+ page_cache_release(page);
+ if (check)
+ printk("%s: BADNESS: truncate check %u\n",
current->comm, check);
+ }
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
@@ -1865,6 +1893,7 @@ do_expand:
goto out_sig;
if (offset > inode->i_sb->s_maxbytes)
goto out_big;
+ check_last_page(mapping, inode->i_size);
i_size_write(inode, offset);
out_truncate:
@@ -2206,6 +2235,7 @@ retry:
page_cache_release(new_page);
return VM_FAULT_SIGBUS;
}
+ wait_on_page_writeback(new_page);
}
}
diff -Naupr linux-2.6.19.orig/mm/page-writeback.c
linux-2.6.19/mm/page-writeback.c
--- linux-2.6.19.orig/mm/page-writeback.c 2006-11-29
14:57:37.000000000 -0700
+++ linux-2.6.19/mm/page-writeback.c 2006-12-21 01:26:53.000000000 -0700
@@ -843,39 +843,6 @@ int set_page_dirty_lock(struct page *pag
EXPORT_SYMBOL(set_page_dirty_lock);
/*
- * Clear a page's dirty flag, while caring for dirty memory accounting.
- * Returns true if the page was previously dirty.
- */
-int test_clear_page_dirty(struct page *page)
-{
- struct address_space *mapping = page_mapping(page);
- unsigned long flags;
-
- if (mapping) {
- write_lock_irqsave(&mapping->tree_lock, flags);
- if (TestClearPageDirty(page)) {
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page),
- PAGECACHE_TAG_DIRTY);
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- /*
- * We can continue to use `mapping' here because the
- * page is locked, which pins the address_space
- */
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
- dec_zone_page_state(page, NR_FILE_DIRTY);
- }
- return 1;
- }
- write_unlock_irqrestore(&mapping->tree_lock, flags);
- return 0;
- }
- return TestClearPageDirty(page);
-}
-EXPORT_SYMBOL(test_clear_page_dirty);
-
-/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*
diff -Naupr linux-2.6.19.orig/mm/rmap.c linux-2.6.19/mm/rmap.c
--- linux-2.6.19.orig/mm/rmap.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/rmap.c 2006-12-22 23:25:09.000000000 -0700
@@ -432,7 +432,7 @@ static int page_mkclean_one(struct page
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
- pte_t *pte, entry;
+ pte_t *pte;
spinlock_t *ptl;
int ret = 0;
@@ -444,17 +444,18 @@ static int page_mkclean_one(struct page
if (!pte)
goto out;
- if (!pte_dirty(*pte) && !pte_write(*pte))
- goto unlock;
+ if (pte_dirty(*pte) || pte_write(*pte)) {
+ pte_t entry;
- entry = ptep_get_and_clear(mm, address, pte);
- entry = pte_mkclean(entry);
- entry = pte_wrprotect(entry);
- ptep_establish(vma, address, pte, entry);
- lazy_mmu_prot_update(entry);
- ret = 1;
+ flush_cache_page(vma, address, pte_pfn(*pte));
+ entry = ptep_clear_flush(vma, address, pte);
+ entry = pte_wrprotect(entry);
+ entry = pte_mkclean(entry);
+ set_pte_at(vma, address, pte, entry);
+ lazy_mmu_prot_update(entry);
+ ret = 1;
+ }
-unlock:
pte_unmap_unlock(pte, ptl);
out:
return ret;
@@ -489,6 +490,8 @@ int page_mkclean(struct page *page)
if (mapping)
ret = page_mkclean_file(mapping, page);
}
+ if (page_test_and_clear_dirty(page))
+ ret = 1;
return ret;
}
@@ -587,8 +590,6 @@ void page_remove_rmap(struct page *page)
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/
- if (page_test_and_clear_dirty(page))
- set_page_dirty(page);
__dec_zone_page_state(page,
PageAnon(page) ? NR_ANON_PAGES :
NR_FILE_MAPPED);
}
@@ -607,6 +608,7 @@ static int try_to_unmap_one(struct page
pte_t pteval;
spinlock_t *ptl;
int ret = SWAP_AGAIN;
+ struct page *dirty_page = NULL;
address = vma_address(page, vma);
if (address == -EFAULT)
@@ -633,7 +635,7 @@ static int try_to_unmap_one(struct page
/* Move the dirty bit to the physical page now the pte is gone. */
if (pte_dirty(pteval))
- set_page_dirty(page);
+ dirty_page = page;
/* Update high watermark before we lower rss */
update_hiwater_rss(mm);
@@ -684,6 +686,8 @@ static int try_to_unmap_one(struct page
out_unmap:
pte_unmap_unlock(pte, ptl);
+ if (dirty_page)
+ set_page_dirty(dirty_page);
out:
return ret;
}
@@ -915,6 +919,9 @@ int try_to_unmap(struct page *page, int
else
ret = try_to_unmap_file(page, migration);
+ if (page_test_and_clear_dirty(page))
+ set_page_dirty(page);
+
if (!page_mapped(page))
ret = SWAP_SUCCESS;
return ret;
diff -Naupr linux-2.6.19.orig/mm/truncate.c linux-2.6.19/mm/truncate.c
--- linux-2.6.19.orig/mm/truncate.c 2006-11-29 14:57:37.000000000 -0700
+++ linux-2.6.19/mm/truncate.c 2006-12-23 13:21:42.000000000 -0700
@@ -50,6 +50,21 @@ static inline void truncate_partial_page
do_invalidatepage(page, partial);
}
+void cancel_dirty_page(struct page *page, unsigned int account_size)
+{
+ /* If we're cancelling the page, it had better not be mapped
any more */+ if (page_mapped(page)) {
+ static unsigned int warncount;
+
+ WARN_ON(++warncount < 5);
+ }
+
+ if (TestClearPageDirty(page) && account_size &&
+ mapping_cap_account_dirty(page->mapping))
+ dec_zone_page_state(page, NR_FILE_DIRTY);
+}
+
+
/*
* If truncate cannot remove the fs-private metadata from the page, the page
* becomes anonymous. It will be left on the LRU and may even be mapped into
@@ -66,10 +81,11 @@ truncate_complete_page(struct address_sp
if (page->mapping != mapping)
return;
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
if (PagePrivate(page))
do_invalidatepage(page, 0);
- clear_page_dirty(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
remove_from_page_cache(page);
@@ -348,7 +364,6 @@ int invalidate_inode_pages2_range(struct
for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
pgoff_t page_index;
- int was_dirty;
lock_page(page);
if (page->mapping != mapping) {
@@ -384,12 +399,8 @@ int invalidate_inode_pages2_range(struct
PAGE_CACHE_SIZE, 0);
}
}
- was_dirty = test_clear_page_dirty(page);
- if (!invalidate_complete_page2(mapping, page)) {
- if (was_dirty)
- set_page_dirty(page);
+ if (!invalidate_complete_page2(mapping, page))
ret = -EIO;
- }
unlock_page(page);
}
pagevec_release(&pvec);
--
Gordon Farquharson
On Sun, 24 Dec 2006, Gordon Farquharson wrote:
>
> The apt cache files (/var/cache/apt/*.bin) still get corrupted with
> this patch and 2.6.19.
Yeah, if my guess about do_no_page() is right, _none_ of the previous
patches should have ANY effect what-so-ever. In fact, I'd say that even
the "ext3 works in writeback mode" thing that Andrei reports is probably a
total fluke brought on by timing changes rather than anything else.
So please try the latest patch instead (on top of anything that shows
corruption reliably - the patch should be _totally_ independent of all the
other issues, and I think it will apply cleanly on top of 2.6.18.3 and
2.6.19 too, so anything that shows corruption is a fine target - but try
to choose something that has been the "best" at corrupting things for you,
to make the testing as good as possible).
Patch included here again (although I think you were cc'd on my previous
email too, so you should already have it, and our emails just crossed)
And if this doesn't fix it, I don't know what will..
Linus
---
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..cf429c4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2247,21 +2249,23 @@ retry:
if (pte_none(*page_table)) {
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
- if (write_access)
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
+ if (write_access)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+ entry = pte_wrprotect(entry);
if (write_access) {
dirty_page = new_page;
get_page(dirty_page);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
}
}
+ set_pte_at(mm, address, page_table, entry);
} else {
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);
On Sun, 2006-12-24 at 11:35 -0800, Linus Torvalds wrote:
>
> On Sun, 24 Dec 2006, Gordon Farquharson wrote:
> >
> > The apt cache files (/var/cache/apt/*.bin) still get corrupted with
> > this patch and 2.6.19.
>
> Yeah, if my guess about do_no_page() is right, _none_ of the previous
> patches should have ANY effect what-so-ever. In fact, I'd say that even
> the "ext3 works in writeback mode" thing that Andrei reports is probably a
> total fluke brought on by timing changes rather than anything else.
>
> So please try the latest patch instead (on top of anything that shows
> corruption reliably - the patch should be _totally_ independent of all the
> other issues, and I think it will apply cleanly on top of 2.6.18.3 and
> 2.6.19 too, so anything that shows corruption is a fine target - but try
> to choose something that has been the "best" at corrupting things for you,
> to make the testing as good as possible).
>
> Patch included here again (although I think you were cc'd on my previous
> email too, so you should already have it, and our emails just crossed)
>
> And if this doesn't fix it, I don't know what will..
With latest git and patches:
http://lkml.org/lkml/diff/2006/12/24/56/1
http://lkml.org/lkml/diff/2006/12/24/61/1
Hash check on download completion found bad chunks, consider using
"safe_sync".
>
> Linus
>
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..cf429c4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2247,21 +2249,23 @@ retry:
> if (pte_none(*page_table)) {
> flush_icache_page(vma, new_page);
> entry = mk_pte(new_page, vma->vm_page_prot);
> - if (write_access)
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> - set_pte_at(mm, address, page_table, entry);
> if (anon) {
> inc_mm_counter(mm, anon_rss);
> lru_cache_add_active(new_page);
> page_add_new_anon_rmap(new_page, vma, address);
> + if (write_access)
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> } else {
> inc_mm_counter(mm, file_rss);
> page_add_file_rmap(new_page);
> + entry = pte_wrprotect(entry);
> if (write_access) {
> dirty_page = new_page;
> get_page(dirty_page);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> }
> }
> + set_pte_at(mm, address, page_table, entry);
> } else {
> /* One of our sibling threads was faster, back out. */
> page_cache_release(new_page);
On Sun, 24 Dec 2006, Andrei Popa wrote:
>
> Hash check on download completion found bad chunks, consider using
> "safe_sync".
Dang. Did you get any warning messages from the kernel?
Linus
On Sun, 2006-12-24 at 12:24 -0800, Linus Torvalds wrote:
>
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> >
> > Hash check on download completion found bad chunks, consider using
> > "safe_sync".
>
> Dang. Did you get any warning messages from the kernel?
>
only these:
ACPI: EC: evaluating _Q80
ACPI: EC: evaluating _Q80
ACPI: EC: evaluating _Q80
but I don't think has anything to do with...
> Linus
On 12/24/06, Linus Torvalds <[email protected]> wrote:
> Ok, so how about this diff.
>
> I'm actually feeling good about this one. It really looks like
> "do_no_page()" was simply buggy, and that this explains everything.
I tested with just this patch and 2.6.19 and no change. Sorry Linus,
no early Christmas present :-(
Gordon
--
Gordon Farquharson
> Quoting Linus Torvalds <[email protected]>:
> Subject: Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)
>
> Peter, tell me I'm crazy, but with the new rules, the following condition
> is a bug:
>
> - shared mapping
> - writable
> - not already marked dirty in the PTE
>
> because that combination means that the hardware can mark the PTE dirty
> without us even realizing (and thus not marking the "struct page *"
> dirty).
Er.
Sorry about bumping in, and I'm not sure I understand all of the discussion,
but this reminded me of an old issue with COW that created what looks
like a vaguely similiar data corruption on infiniband. We solved this for
infiniband with MADV_DONTFORK, but I always wondered why does it not affect
other parts of kernel. Small reminder from that discussion:
down mmap sem
get user pages
up mmap sem
page becomes shared, and COW (e.g. fork)
process writes to first byte of page <----- gets a copy
Now we had a problem: struct page that we got from get user pages
does not point to a correct page in our process.
For example: if at some point we map this page for DMA, and
hardware writes to last byte of page -----> process does not
see this data.
So for infiniband, what we do is a combination of
- prevent page from becoming COW while hardware might DMA to this page, and
- ask users not to write to page if hardware might DMA to same page
(even if its using different bytes).
I just wandered - is there some chance something like this could be happening in
the fs code?
HTH,
--
MST
* Linus Torvalds <[email protected]> [2006-12-24 11:35]:
> And if this doesn't fix it, I don't know what will..
Sorry, but it still fails (on top of plain 2.6.19).
--
Martin Michlmayr
http://www.cyrius.com/
Linus Torvalds wrote:
>
> On Sun, 24 Dec 2006, Linus Torvalds wrote:
>
>>Peter, tell me I'm crazy, but with the new rules, the following condition
>>is a bug:
>>
>> - shared mapping
>> - writable
>> - not already marked dirty in the PTE
>
>
> Ok, so how about this diff.
>
> I'm actually feeling good about this one. It really looks like
> "do_no_page()" was simply buggy, and that this explains everything.
Still trying to catch up here, so I'm not going to reply to any old
stuff and just start at the tip of the thread... Other than to say
that I really like cancel_page_dirty ;)
I think your patch is quite right so that's a good catch. But I'm not
too surprised that it does not help the problem, because I don't
think we have started shedding any old pte_dirty tests at
unmap/reclaim-time, have we? So the dirty bit isn't going to get lost,
as such.
I was hoping that you've almost narrowed it down to the filesystem
writeback code, with the last few mails?
Nick
> Please please please test. Throw all the other patches away (with the
> possible exception of the "update_mmu_cache()" sanity checker, which is
> still interesting in case some _other_ place does this too).
>
> Don't do the "wait_on_page_writeback()" thing, because it changes timings
> and might hide thngs for the wrong reasons. Just apply this on top of a
> known failing kernel, and test.
>
> Linus
>
> ---
> diff --git a/mm/memory.c b/mm/memory.c
> index 563792f..cf429c4 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2247,21 +2249,23 @@ retry:
> if (pte_none(*page_table)) {
> flush_icache_page(vma, new_page);
> entry = mk_pte(new_page, vma->vm_page_prot);
> - if (write_access)
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> - set_pte_at(mm, address, page_table, entry);
> if (anon) {
> inc_mm_counter(mm, anon_rss);
> lru_cache_add_active(new_page);
> page_add_new_anon_rmap(new_page, vma, address);
> + if (write_access)
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> } else {
> inc_mm_counter(mm, file_rss);
> page_add_file_rmap(new_page);
> + entry = pte_wrprotect(entry);
> if (write_access) {
> dirty_page = new_page;
> get_page(dirty_page);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> }
> }
> + set_pte_at(mm, address, page_table, entry);
> } else {
> /* One of our sibling threads was faster, back out. */
> page_cache_release(new_page);
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Linus Torvalds wrote:
> I don't think it's a page table issue any more, it just doesn't look
> likely with the ARM UP corruption. It's also not apparently even on a
> cacheline boundary, so it probably is really a dirty bit that got cleared
> wrogn due to some race with IO.
So, until now it's only been reported for SMP on i386?
I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
here, UP kernel, no preempt.
I've first seen it with 2.6.20-rc1, but am running 2.6.20-rc2 now.
The corruption pattern looks like the one already reported, rtorrent
hash check fails (for some files it succeeds at first, but
fails after "echo 1 > /proc/sys/vm/drop_caches"), the corruption is
zeroes at the end of page instead of data.
ii rtorrent 0.6.4-1 ncurses BitTorrent client based on LibTorren
ii libtorrent9 0.10.4-1 a C++ BitTorrent library
.config:
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.20-rc2
# Mon Dec 25 14:00:03 2006
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set
#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
#
# Block layer
#
CONFIG_BLOCK=y
CONFIG_LBD=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_LSF is not set
#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"
#
# Processor type and features
#
# CONFIG_SMP is not set
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
# CONFIG_PARAVIRT is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_X86_UP_APIC=y
CONFIG_X86_UP_IOAPIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y
CONFIG_VM86=y
# CONFIG_TOSHIBA is not set
# CONFIG_I8K is not set
# CONFIG_X86_REBOOTFIXUPS is not set
# CONFIG_MICROCODE is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
CONFIG_DCDBAS=m
CONFIG_NOHIGHMEM=y
# CONFIG_HIGHMEM4G is not set
# CONFIG_HIGHMEM64G is not set
CONFIG_PAGE_OFFSET=0xC0000000
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_SPARSEMEM_STATIC=y
CONFIG_SPLIT_PTLOCK_CPUS=4
# CONFIG_RESOURCES_64BIT is not set
# CONFIG_MATH_EMULATION is not set
CONFIG_MTRR=y
# CONFIG_EFI is not set
# CONFIG_SECCOMP is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_300=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=300
# CONFIG_KEXEC is not set
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x100000
CONFIG_COMPAT_VDSO=y
#
# Power management options (ACPI, APM)
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set
# CONFIG_PM_SYSFS_DEPRECATED is not set
CONFIG_SOFTWARE_SUSPEND=y
CONFIG_PM_STD_PARTITION=""
#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=y
CONFIG_ACPI_HOTKEY=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_THERMAL=y
# CONFIG_ACPI_ASUS is not set
CONFIG_ACPI_IBM=m
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
# CONFIG_ACPI_CONTAINER is not set
# CONFIG_ACPI_SBS is not set
#
# APM (Advanced Power Management) BIOS Support
#
# CONFIG_APM is not set
#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=y
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=y
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
#
# CPUFreq processor drivers
#
# CONFIG_X86_ACPI_CPUFREQ is not set
# CONFIG_X86_POWERNOW_K6 is not set
# CONFIG_X86_POWERNOW_K7 is not set
# CONFIG_X86_POWERNOW_K8 is not set
# CONFIG_X86_GX_SUSPMOD is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y
CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE=y
CONFIG_X86_SPEEDSTEP_ICH=y
CONFIG_X86_SPEEDSTEP_SMI=y
# CONFIG_X86_P4_CLOCKMOD is not set
# CONFIG_X86_CPUFREQ_NFORCE2 is not set
# CONFIG_X86_LONGRUN is not set
# CONFIG_X86_LONGHAUL is not set
#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
CONFIG_X86_SPEEDSTEP_LIB=y
# CONFIG_X86_SPEEDSTEP_RELAXED_CAP_CHECK is not set
#
# Bus options (PCI, PCMCIA, EISA, MCA, ISA)
#
CONFIG_PCI=y
# CONFIG_PCI_GOBIOS is not set
# CONFIG_PCI_GOMMCONFIG is not set
# CONFIG_PCI_GODIRECT is not set
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCIEPORTBUS=y
# CONFIG_HOTPLUG_PCI_PCIE is not set
CONFIG_PCIEAER=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_MULTITHREAD_PROBE is not set
# CONFIG_PCI_DEBUG is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_ISA=y
# CONFIG_EISA is not set
# CONFIG_MCA is not set
# CONFIG_SCx200 is not set
#
# PCCARD (PCMCIA/CardBus) support
#
CONFIG_PCCARD=y
# CONFIG_PCMCIA_DEBUG is not set
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_PCMCIA_IOCTL=y
CONFIG_CARDBUS=y
#
# PC-card bridges
#
CONFIG_YENTA=y
CONFIG_YENTA_O2=y
CONFIG_YENTA_RICOH=y
CONFIG_YENTA_TI=y
CONFIG_YENTA_ENE_TUNE=y
CONFIG_YENTA_TOSHIBA=y
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_I82365 is not set
# CONFIG_TCIC is not set
CONFIG_PCMCIA_PROBE=y
CONFIG_PCCARD_NONSTATIC=y
#
# PCI Hotplug Support
#
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_FAKE is not set
# CONFIG_HOTPLUG_PCI_COMPAQ is not set
CONFIG_HOTPLUG_PCI_IBM=y
CONFIG_HOTPLUG_PCI_ACPI=y
CONFIG_HOTPLUG_PCI_ACPI_IBM=y
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set
#
# Executable file formats
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_AOUT=y
CONFIG_BINFMT_MISC=y
#
# Networking
#
CONFIG_NET=y
#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
CONFIG_TCP_CONG_CUBIC=y
CONFIG_TCP_CONG_WESTWOOD=y
# CONFIG_TCP_CONG_HTCP is not set
CONFIG_TCP_CONG_HSTCP=y
# CONFIG_TCP_CONG_HYBLA is not set
CONFIG_TCP_CONG_VEGAS=y
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
CONFIG_INET6_TUNNEL=y
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_TUNNEL=y
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_BRIDGE_NETFILTER=y
#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_NETLINK_QUEUE=y
CONFIG_NETFILTER_NETLINK_LOG=y
# CONFIG_NF_CONNTRACK_ENABLED is not set
CONFIG_NETFILTER_XTABLES=y
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=y
# CONFIG_NETFILTER_XT_TARGET_DSCP is not set
CONFIG_NETFILTER_XT_TARGET_MARK=y
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=y
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
CONFIG_NETFILTER_XT_MATCH_COMMENT=y
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
CONFIG_NETFILTER_XT_MATCH_MAC=y
CONFIG_NETFILTER_XT_MATCH_MARK=y
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y
CONFIG_NETFILTER_XT_MATCH_PHYSDEV=y
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
CONFIG_NETFILTER_XT_MATCH_REALM=y
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
CONFIG_NETFILTER_XT_MATCH_TCPMSS=y
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
#
# IP: Netfilter Configuration
#
CONFIG_IP_NF_QUEUE=y
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_MATCH_IPRANGE=y
CONFIG_IP_NF_MATCH_TOS=y
# CONFIG_IP_NF_MATCH_RECENT is not set
CONFIG_IP_NF_MATCH_ECN=y
CONFIG_IP_NF_MATCH_AH=y
# CONFIG_IP_NF_MATCH_TTL is not set
CONFIG_IP_NF_MATCH_OWNER=y
CONFIG_IP_NF_MATCH_ADDRTYPE=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_LOG=y
# CONFIG_IP_NF_TARGET_ULOG is not set
CONFIG_IP_NF_TARGET_TCPMSS=y
CONFIG_IP_NF_MANGLE=y
CONFIG_IP_NF_TARGET_TOS=y
CONFIG_IP_NF_TARGET_ECN=y
# CONFIG_IP_NF_TARGET_TTL is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_ARPTABLES is not set
#
# IPv6: Netfilter Configuration (EXPERIMENTAL)
#
CONFIG_IP6_NF_QUEUE=y
# CONFIG_IP6_NF_IPTABLES is not set
#
# Bridge: Netfilter Configuration
#
# CONFIG_BRIDGE_NF_EBTABLES is not set
#
# DCCP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_DCCP is not set
#
# SCTP Configuration (EXPERIMENTAL)
#
# CONFIG_IP_SCTP is not set
#
# TIPC Configuration (EXPERIMENTAL)
#
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
CONFIG_BRIDGE=y
CONFIG_VLAN_8021Q=y
# CONFIG_DECNET is not set
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
#
# QoS and/or fair queueing
#
CONFIG_NET_SCHED=y
CONFIG_NET_SCH_FIFO=y
# CONFIG_NET_SCH_CLK_JIFFIES is not set
# CONFIG_NET_SCH_CLK_GETTIMEOFDAY is not set
CONFIG_NET_SCH_CLK_CPU=y
#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=y
CONFIG_NET_SCH_HTB=y
# CONFIG_NET_SCH_HFSC is not set
CONFIG_NET_SCH_PRIO=y
CONFIG_NET_SCH_RED=y
CONFIG_NET_SCH_SFQ=y
# CONFIG_NET_SCH_TEQL is not set
CONFIG_NET_SCH_TBF=y
CONFIG_NET_SCH_GRED=y
CONFIG_NET_SCH_DSMARK=y
CONFIG_NET_SCH_NETEM=y
CONFIG_NET_SCH_INGRESS=y
#
# Classification
#
CONFIG_NET_CLS=y
CONFIG_NET_CLS_BASIC=y
CONFIG_NET_CLS_TCINDEX=y
CONFIG_NET_CLS_ROUTE4=y
CONFIG_NET_CLS_ROUTE=y
# CONFIG_NET_CLS_FW is not set
CONFIG_NET_CLS_U32=y
# CONFIG_CLS_U32_PERF is not set
# CONFIG_CLS_U32_MARK is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_EMATCH is not set
# CONFIG_NET_CLS_ACT is not set
# CONFIG_NET_CLS_POLICE is not set
# CONFIG_NET_CLS_IND is not set
# CONFIG_NET_ESTIMATOR is not set
#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_HAMRADIO is not set
# CONFIG_IRDA is not set
CONFIG_BT=y
CONFIG_BT_L2CAP=y
CONFIG_BT_SCO=y
CONFIG_BT_RFCOMM=y
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=y
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
CONFIG_BT_HIDP=y
#
# Bluetooth device drivers
#
CONFIG_BT_HCIUSB=m
CONFIG_BT_HCIUSB_SCO=y
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIDTL1 is not set
# CONFIG_BT_HCIBT3C is not set
# CONFIG_BT_HCIBLUECARD is not set
# CONFIG_BT_HCIBTUART is not set
# CONFIG_BT_HCIVHCI is not set
CONFIG_IEEE80211=y
# CONFIG_IEEE80211_DEBUG is not set
CONFIG_IEEE80211_CRYPT_WEP=y
CONFIG_IEEE80211_CRYPT_CCMP=y
CONFIG_IEEE80211_CRYPT_TKIP=y
CONFIG_IEEE80211_SOFTMAC=y
# CONFIG_IEEE80211_SOFTMAC_DEBUG is not set
CONFIG_WIRELESS_EXT=y
#
# Device Drivers
#
#
# Generic Driver Options
#
# CONFIG_STANDALONE is not set
# CONFIG_PREVENT_FIRMWARE_BUILD is not set
CONFIG_FW_LOADER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_SYS_HYPERVISOR is not set
#
# Connector - unified userspace <-> kernelspace linker
#
CONFIG_CONNECTOR=y
# CONFIG_PROC_EVENTS is not set
#
# Memory Technology Devices (MTD)
#
CONFIG_MTD=m
# CONFIG_MTD_DEBUG is not set
# CONFIG_MTD_CONCAT is not set
CONFIG_MTD_PARTITIONS=y
# CONFIG_MTD_REDBOOT_PARTS is not set
#
# User Modules And Translation Layers
#
CONFIG_MTD_CHAR=m
CONFIG_MTD_BLOCK=m
# CONFIG_MTD_BLOCK_RO is not set
CONFIG_FTL=m
CONFIG_NFTL=m
# CONFIG_NFTL_RW is not set
CONFIG_INFTL=m
CONFIG_RFD_FTL=m
# CONFIG_SSFDC is not set
#
# RAM/ROM/Flash chip drivers
#
CONFIG_MTD_CFI=m
CONFIG_MTD_JEDECPROBE=m
CONFIG_MTD_GEN_PROBE=m
# CONFIG_MTD_CFI_ADV_OPTIONS is not set
CONFIG_MTD_MAP_BANK_WIDTH_1=y
CONFIG_MTD_MAP_BANK_WIDTH_2=y
CONFIG_MTD_MAP_BANK_WIDTH_4=y
# CONFIG_MTD_MAP_BANK_WIDTH_8 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_16 is not set
# CONFIG_MTD_MAP_BANK_WIDTH_32 is not set
CONFIG_MTD_CFI_I1=y
CONFIG_MTD_CFI_I2=y
# CONFIG_MTD_CFI_I4 is not set
# CONFIG_MTD_CFI_I8 is not set
CONFIG_MTD_CFI_INTELEXT=m
CONFIG_MTD_CFI_AMDSTD=m
CONFIG_MTD_CFI_STAA=m
CONFIG_MTD_CFI_UTIL=m
CONFIG_MTD_RAM=m
CONFIG_MTD_ROM=m
# CONFIG_MTD_ABSENT is not set
# CONFIG_MTD_OBSOLETE_CHIPS is not set
#
# Mapping drivers for chip access
#
CONFIG_MTD_COMPLEX_MAPPINGS=y
# CONFIG_MTD_PHYSMAP is not set
# CONFIG_MTD_PNC2000 is not set
# CONFIG_MTD_NETSC520 is not set
# CONFIG_MTD_TS5500 is not set
# CONFIG_MTD_SBC_GXX is not set
# CONFIG_MTD_AMD76XROM is not set
# CONFIG_MTD_ICHXROM is not set
# CONFIG_MTD_SCB2_FLASH is not set
# CONFIG_MTD_NETtel is not set
# CONFIG_MTD_L440GX is not set
# CONFIG_MTD_PCI is not set
# CONFIG_MTD_PLATRAM is not set
#
# Self-contained MTD device drivers
#
# CONFIG_MTD_PMC551 is not set
# CONFIG_MTD_SLRAM is not set
# CONFIG_MTD_PHRAM is not set
# CONFIG_MTD_MTDRAM is not set
CONFIG_MTD_BLOCK2MTD=m
#
# Disk-On-Chip Device Drivers
#
# CONFIG_MTD_DOC2000 is not set
# CONFIG_MTD_DOC2001 is not set
# CONFIG_MTD_DOC2001PLUS is not set
#
# NAND Flash Device Drivers
#
CONFIG_MTD_NAND=m
# CONFIG_MTD_NAND_VERIFY_WRITE is not set
# CONFIG_MTD_NAND_ECC_SMC is not set
CONFIG_MTD_NAND_IDS=m
# CONFIG_MTD_NAND_DISKONCHIP is not set
# CONFIG_MTD_NAND_CS553X is not set
# CONFIG_MTD_NAND_NANDSIM is not set
#
# OneNAND Flash Device Drivers
#
# CONFIG_MTD_ONENAND is not set
#
# Parallel port support
#
CONFIG_PARPORT=y
CONFIG_PARPORT_PC=y
# CONFIG_PARPORT_SERIAL is not set
CONFIG_PARPORT_PC_FIFO=y
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_PC_PCMCIA is not set
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
# CONFIG_PARPORT_1284 is not set
#
# Plug and Play support
#
CONFIG_PNP=y
# CONFIG_PNP_DEBUG is not set
#
# Protocols
#
# CONFIG_ISAPNP is not set
# CONFIG_PNPBIOS is not set
CONFIG_PNPACPI=y
#
# Block devices
#
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_XD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_CPQ_DA is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
CONFIG_BLK_DEV_NBD=y
# CONFIG_BLK_DEV_SX8 is not set
# CONFIG_BLK_DEV_UB is not set
# CONFIG_BLK_DEV_RAM is not set
# CONFIG_BLK_DEV_INITRD is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
#
# Misc devices
#
# CONFIG_IBM_ASM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_MSI_LAPTOP is not set
#
# ATA/ATAPI/MFM/RLL support
#
# CONFIG_IDE is not set
#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y
#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
# CONFIG_SCSI_LOGGING is not set
CONFIG_SCSI_SCAN_ASYNC=y
#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
#
# SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_7000FASST is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AHA152X is not set
# CONFIG_SCSI_AHA1542 is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_IN2000 is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_DTC3280 is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_GENERIC_NCR5380 is not set
# CONFIG_SCSI_GENERIC_NCR5380_MMIO is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_NCR53C406A is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_PAS16 is not set
# CONFIG_SCSI_PSI240I is not set
# CONFIG_SCSI_QLOGIC_FAS is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_SYM53C416 is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_T128 is not set
# CONFIG_SCSI_U14_34F is not set
# CONFIG_SCSI_ULTRASTOR is not set
# CONFIG_SCSI_NSP32 is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_SRP is not set
#
# PCMCIA SCSI adapter support
#
# CONFIG_PCMCIA_AHA152X is not set
# CONFIG_PCMCIA_FDOMAIN is not set
# CONFIG_PCMCIA_NINJA_SCSI is not set
# CONFIG_PCMCIA_QLOGIC is not set
# CONFIG_PCMCIA_SYM53C500 is not set
#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
CONFIG_ATA=y
CONFIG_SATA_AHCI=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=y
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5535 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_LEGACY is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_QDI is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
# CONFIG_PATA_WINBOND_VLB is not set
#
# Old CD-ROM drivers (not SCSI, not IDE)
#
# CONFIG_CD_NO_IDESCSI is not set
#
# Multi-device support (RAID and LVM)
#
CONFIG_MD=y
# CONFIG_BLK_DEV_MD is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
# CONFIG_DM_MIRROR is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
#
# Fusion MPT device support
#
# CONFIG_FUSION is not set
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set
#
# IEEE 1394 (FireWire) support
#
CONFIG_IEEE1394=y
#
# Subsystem Options
#
# CONFIG_IEEE1394_VERBOSEDEBUG is not set
# CONFIG_IEEE1394_OUI_DB is not set
CONFIG_IEEE1394_EXTRA_CONFIG_ROMS=y
CONFIG_IEEE1394_CONFIG_ROM_IP1394=y
# CONFIG_IEEE1394_EXPORT_FULL_API is not set
#
# Device Drivers
#
# CONFIG_IEEE1394_PCILYNX is not set
CONFIG_IEEE1394_OHCI1394=m
#
# Protocol Drivers
#
# CONFIG_IEEE1394_VIDEO1394 is not set
CONFIG_IEEE1394_SBP2=y
# CONFIG_IEEE1394_SBP2_PHYS_DMA is not set
CONFIG_IEEE1394_ETH1394=y
# CONFIG_IEEE1394_DV1394 is not set
CONFIG_IEEE1394_RAWIO=y
#
# I2O device support
#
# CONFIG_I2O is not set
#
# Network device support
#
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
CONFIG_BONDING=y
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
# CONFIG_NET_SB1000 is not set
#
# ARCnet devices
#
# CONFIG_ARCNET is not set
#
# PHY device support
#
# CONFIG_PHYLIB is not set
#
# Ethernet (10 or 100Mbit)
#
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_LANCE is not set
# CONFIG_NET_VENDOR_SMC is not set
# CONFIG_NET_VENDOR_RACAL is not set
#
# Tulip family network device support
#
# CONFIG_NET_TULIP is not set
# CONFIG_AT1700 is not set
# CONFIG_DEPCA is not set
# CONFIG_HP100 is not set
# CONFIG_NET_ISA is not set
CONFIG_NET_PCI=y
CONFIG_PCNET32=y
# CONFIG_PCNET32_NAPI is not set
CONFIG_AMD8111_ETH=y
CONFIG_AMD8111E_NAPI=y
# CONFIG_ADAPTEC_STARFIRE is not set
# CONFIG_AC3200 is not set
# CONFIG_APRICOT is not set
# CONFIG_B44 is not set
# CONFIG_FORCEDETH is not set
# CONFIG_CS89x0 is not set
# CONFIG_DGRS is not set
# CONFIG_EEPRO100 is not set
CONFIG_E100=y
# CONFIG_FEALNX is not set
# CONFIG_NATSEMI is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
CONFIG_8139TOO_PIO=y
# CONFIG_8139TOO_TUNE_TWISTER is not set
# CONFIG_8139TOO_8129 is not set
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_SIS900 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SUNDANCE is not set
# CONFIG_TLAN is not set
# CONFIG_VIA_RHINE is not set
# CONFIG_NET_POCKET is not set
#
# Ethernet (1000 Mbit)
#
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
# CONFIG_R8169 is not set
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_SK98LIN is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_TIGON3=y
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
#
# Ethernet (10000 Mbit)
#
# CONFIG_CHELSIO_T1 is not set
# CONFIG_IXGB is not set
CONFIG_S2IO=m
# CONFIG_S2IO_NAPI is not set
# CONFIG_MYRI10GE is not set
# CONFIG_NETXEN_NIC is not set
#
# Token Ring devices
#
# CONFIG_TR is not set
#
# Wireless LAN (non-hamradio)
#
CONFIG_NET_RADIO=y
CONFIG_NET_WIRELESS_RTNETLINK=y
#
# Obsolete Wireless cards support (pre-802.11)
#
# CONFIG_STRIP is not set
# CONFIG_ARLAN is not set
# CONFIG_WAVELAN is not set
# CONFIG_PCMCIA_WAVELAN is not set
# CONFIG_PCMCIA_NETWAVE is not set
#
# Wireless 802.11 Frequency Hopping cards support
#
# CONFIG_PCMCIA_RAYCS is not set
#
# Wireless 802.11b ISA/PCI cards support
#
# CONFIG_IPW2100 is not set
CONFIG_IPW2200=m
CONFIG_IPW2200_MONITOR=y
CONFIG_IPW2200_RADIOTAP=y
CONFIG_IPW2200_PROMISCUOUS=y
CONFIG_IPW2200_QOS=y
# CONFIG_IPW2200_DEBUG is not set
# CONFIG_AIRO is not set
# CONFIG_HERMES is not set
# CONFIG_ATMEL is not set
#
# Wireless 802.11b Pcmcia/Cardbus cards support
#
# CONFIG_AIRO_CS is not set
# CONFIG_PCMCIA_WL3501 is not set
#
# Prism GT/Duette 802.11(a/b/g) PCI/Cardbus support
#
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
CONFIG_HOSTAP=m
CONFIG_HOSTAP_FIRMWARE=y
CONFIG_HOSTAP_FIRMWARE_NVRAM=y
# CONFIG_HOSTAP_PLX is not set
# CONFIG_HOSTAP_PCI is not set
CONFIG_HOSTAP_CS=m
# CONFIG_BCM43XX is not set
CONFIG_ZD1211RW=m
# CONFIG_ZD1211RW_DEBUG is not set
CONFIG_NET_WIRELESS=y
#
# PCMCIA network device support
#
# CONFIG_NET_PCMCIA is not set
#
# Wan interfaces
#
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PLIP is not set
CONFIG_PPP=y
CONFIG_PPP_MULTILINK=y
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=y
CONFIG_PPP_SYNC_TTY=y
CONFIG_PPP_DEFLATE=y
CONFIG_PPP_BSDCOMP=y
# CONFIG_PPP_MPPE is not set
CONFIG_PPPOE=y
# CONFIG_SLIP is not set
CONFIG_SLHC=y
# CONFIG_NET_FC is not set
# CONFIG_SHAPER is not set
CONFIG_NETCONSOLE=y
CONFIG_NETPOLL=y
# CONFIG_NETPOLL_RX is not set
# CONFIG_NETPOLL_TRAP is not set
CONFIG_NET_POLL_CONTROLLER=y
#
# ISDN subsystem
#
# CONFIG_ISDN is not set
#
# Telephony Support
#
# CONFIG_PHONE is not set
#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set
#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
# CONFIG_INPUT_TSDEV is not set
# CONFIG_INPUT_EVDEV is not set
# CONFIG_INPUT_EVBUG is not set
#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_XTKBD is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_INPORT is not set
# CONFIG_MOUSE_LOGIBM is not set
# CONFIG_MOUSE_PC110PAD is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_WISTRON_BTNS is not set
CONFIG_INPUT_UINPUT=m
#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_GAMEPORT is not set
#
# Character devices
#
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
# CONFIG_SERIAL_NONSTANDARD is not set
#
# Serial drivers
#
CONFIG_SERIAL_8250=y
# CONFIG_SERIAL_8250_CONSOLE is not set
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=4
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
# CONFIG_SERIAL_8250_EXTENDED is not set
#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
CONFIG_PRINTER=m
# CONFIG_LP_CONSOLE is not set
CONFIG_PPDEV=m
# CONFIG_TIPAR is not set
#
# IPMI
#
# CONFIG_IPMI_HANDLER is not set
#
# Watchdog Cards
#
# CONFIG_WATCHDOG is not set
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_INTEL=y
CONFIG_HW_RANDOM_AMD=y
CONFIG_HW_RANDOM_GEODE=y
CONFIG_HW_RANDOM_VIA=y
# CONFIG_NVRAM is not set
CONFIG_RTC=y
# CONFIG_DTLK is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_SONYPI is not set
CONFIG_AGP=y
# CONFIG_AGP_ALI is not set
# CONFIG_AGP_ATI is not set
# CONFIG_AGP_AMD is not set
# CONFIG_AGP_AMD64 is not set
CONFIG_AGP_INTEL=y
# CONFIG_AGP_NVIDIA is not set
# CONFIG_AGP_SIS is not set
# CONFIG_AGP_SWORKS is not set
# CONFIG_AGP_VIA is not set
# CONFIG_AGP_EFFICEON is not set
CONFIG_DRM=y
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
CONFIG_DRM_RADEON=m
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I830 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_NSC_GPIO is not set
# CONFIG_CS5535_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
# CONFIG_HPET_RTC_IRQ is not set
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set
#
# TPM devices
#
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=y
CONFIG_TCG_NSC=y
CONFIG_TCG_ATMEL=y
CONFIG_TCG_INFINEON=y
# CONFIG_TELCLOCK is not set
#
# I2C support
#
CONFIG_I2C=y
CONFIG_I2C_CHARDEV=y
#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=y
# CONFIG_I2C_ALGOPCF is not set
# CONFIG_I2C_ALGOPCA is not set
#
# I2C Hardware Bus support
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_ELEKTOR is not set
CONFIG_I2C_I801=y
CONFIG_I2C_I810=y
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PARPORT is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_PROSAVAGE is not set
# CONFIG_I2C_SAVAGE4 is not set
# CONFIG_SCx200_ACB is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set
# CONFIG_I2C_VOODOO3 is not set
# CONFIG_I2C_PCA_ISA is not set
#
# Miscellaneous I2C Chip support
#
# CONFIG_SENSORS_DS1337 is not set
# CONFIG_SENSORS_DS1374 is not set
CONFIG_SENSORS_EEPROM=m
# CONFIG_SENSORS_PCF8574 is not set
# CONFIG_SENSORS_PCA9539 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_SENSORS_MAX6875 is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
#
# SPI support
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set
#
# Dallas's 1-wire bus
#
# CONFIG_W1 is not set
#
# Hardware Monitoring support
#
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
CONFIG_SENSORS_HDAPS=m
# CONFIG_HWMON_DEBUG_CHIP is not set
#
# Multimedia devices
#
# CONFIG_VIDEO_DEV is not set
#
# Digital Video Broadcasting Devices
#
# CONFIG_DVB is not set
# CONFIG_USB_DABUSB is not set
#
# Graphics support
#
CONFIG_FIRMWARE_EDID=y
CONFIG_FB=m
CONFIG_FB_DDC=m
CONFIG_FB_CFB_FILLRECT=m
CONFIG_FB_CFB_COPYAREA=m
CONFIG_FB_CFB_IMAGEBLIT=m
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
# CONFIG_FB_TILEBLITTING is not set
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I810 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
CONFIG_FB_RADEON=m
CONFIG_FB_RADEON_I2C=y
CONFIG_FB_RADEON_DEBUG=y
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_CYBLA is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_GEODE is not set
# CONFIG_FB_VIRTUAL is not set
#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
CONFIG_VIDEO_SELECT=y
# CONFIG_MDA_CONSOLE is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=m
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
CONFIG_FONTS=y
# CONFIG_FONT_8x8 is not set
# CONFIG_FONT_8x16 is not set
# CONFIG_FONT_6x11 is not set
# CONFIG_FONT_7x14 is not set
# CONFIG_FONT_PEARL_8x8 is not set
# CONFIG_FONT_ACORN_8x8 is not set
# CONFIG_FONT_MINI_4x6 is not set
# CONFIG_FONT_SUN8x16 is not set
CONFIG_FONT_SUN12x22=y
# CONFIG_FONT_10x18 is not set
#
# Logo configuration
#
# CONFIG_LOGO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
CONFIG_BACKLIGHT_CLASS_DEVICE=m
CONFIG_BACKLIGHT_DEVICE=y
CONFIG_LCD_CLASS_DEVICE=m
CONFIG_LCD_DEVICE=y
#
# Sound
#
CONFIG_SOUND=y
#
# Advanced Linux Sound Architecture
#
CONFIG_SND=y
CONFIG_SND_TIMER=y
CONFIG_SND_PCM=y
# CONFIG_SND_SEQUENCER is not set
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=y
CONFIG_SND_PCM_OSS=y
# CONFIG_SND_PCM_OSS_PLUGINS is not set
CONFIG_SND_RTCTIMER=y
# CONFIG_SND_DYNAMIC_MINORS is not set
# CONFIG_SND_SUPPORT_OLD_API is not set
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
#
# Generic devices
#
CONFIG_SND_AC97_CODEC=y
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
#
# ISA devices
#
# CONFIG_SND_ADLIB is not set
# CONFIG_SND_AD1816A is not set
# CONFIG_SND_AD1848 is not set
# CONFIG_SND_ALS100 is not set
# CONFIG_SND_AZT2320 is not set
# CONFIG_SND_CMI8330 is not set
# CONFIG_SND_CS4231 is not set
# CONFIG_SND_CS4232 is not set
# CONFIG_SND_CS4236 is not set
# CONFIG_SND_DT019X is not set
# CONFIG_SND_ES968 is not set
# CONFIG_SND_ES1688 is not set
# CONFIG_SND_ES18XX is not set
# CONFIG_SND_GUSCLASSIC is not set
# CONFIG_SND_GUSEXTREME is not set
# CONFIG_SND_GUSMAX is not set
# CONFIG_SND_INTERWAVE is not set
# CONFIG_SND_INTERWAVE_STB is not set
# CONFIG_SND_OPL3SA2 is not set
# CONFIG_SND_OPTI92X_AD1848 is not set
# CONFIG_SND_OPTI92X_CS4231 is not set
# CONFIG_SND_OPTI93X is not set
# CONFIG_SND_MIRO is not set
# CONFIG_SND_SB8 is not set
# CONFIG_SND_SB16 is not set
# CONFIG_SND_SBAWE is not set
# CONFIG_SND_SGALAXY is not set
# CONFIG_SND_SSCAPE is not set
# CONFIG_SND_WAVEFRONT is not set
#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=y
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
CONFIG_SND_INTEL8X0=y
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_AC97_POWER_SAVE=y
#
# USB devices
#
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
#
# PCMCIA devices
#
# CONFIG_SND_VXPOCKET is not set
# CONFIG_SND_PDAUDIOCF is not set
#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=y
#
# HID Devices
#
CONFIG_HID=y
#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
CONFIG_USB_MULTITHREAD_PROBE=y
# CONFIG_USB_OTG is not set
#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
# CONFIG_USB_EHCI_SPLIT_ISO is not set
# CONFIG_USB_EHCI_ROOT_HUB_TT is not set
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_OHCI_HCD is not set
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set
#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=y
#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#
#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
CONFIG_USB_LIBUSUAL=y
#
# USB Input Devices
#
CONFIG_USB_HID=y
# CONFIG_USB_HID_POWERBOOK is not set
# CONFIG_HID_FF is not set
# CONFIG_USB_HIDDEV is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set
#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
CONFIG_USB_USBNET_MII=y
CONFIG_USB_USBNET=y
CONFIG_USB_NET_AX8817X=y
CONFIG_USB_NET_CDCETHER=m
# CONFIG_USB_NET_GL620A is not set
CONFIG_USB_NET_NET1080=m
# CONFIG_USB_NET_PLUSB is not set
# CONFIG_USB_NET_MCS7830 is not set
CONFIG_USB_NET_RNDIS_HOST=m
CONFIG_USB_NET_CDC_SUBSET=m
# CONFIG_USB_ALI_M5632 is not set
# CONFIG_USB_AN2720 is not set
CONFIG_USB_BELKIN=y
CONFIG_USB_ARMLINUX=y
# CONFIG_USB_EPSON2888 is not set
CONFIG_USB_NET_ZAURUS=m
CONFIG_USB_MON=y
#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
#
# USB Serial Converter support
#
CONFIG_USB_SERIAL=y
# CONFIG_USB_SERIAL_CONSOLE is not set
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_AIRPRIME is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP2101 is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_FUNSOFT is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
CONFIG_USB_SERIAL_PL2303=y
CONFIG_USB_SERIAL_HP4X=y
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_DEBUG is not set
#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_TEST is not set
#
# USB DSL modem support
#
#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set
#
# MMC/SD Card support
#
# CONFIG_MMC is not set
#
# LED devices
#
# CONFIG_NEW_LEDS is not set
#
# LED drivers
#
#
# LED Triggers
#
#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set
#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
CONFIG_EDAC=y
#
# Reporting subsystems
#
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_MM_EDAC=y
# CONFIG_EDAC_AMD76X is not set
# CONFIG_EDAC_E7XXX is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82875P is not set
# CONFIG_EDAC_I82860 is not set
# CONFIG_EDAC_R82600 is not set
CONFIG_EDAC_POLL=y
#
# Real Time Clock
#
# CONFIG_RTC_CLASS is not set
#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set
#
# DMA Clients
#
#
# DMA Devices
#
#
# Virtualization
#
# CONFIG_KVM is not set
#
# File systems
#
CONFIG_EXT2_FS=y
CONFIG_EXT2_FS_XATTR=y
CONFIG_EXT2_FS_POSIX_ACL=y
CONFIG_EXT2_FS_SECURITY=y
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
# CONFIG_REISERFS_FS_XATTR is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_MINIX_FS=y
# CONFIG_ROMFS_FS is not set
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
# CONFIG_AUTOFS_FS is not set
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
# CONFIG_ZISOFS is not set
CONFIG_UDF_FS=y
CONFIG_UDF_NLS=y
#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=y
CONFIG_MSDOS_FS=y
CONFIG_VFAT_FS=y
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
CONFIG_NTFS_FS=y
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y
#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
CONFIG_CONFIGFS_FS=y
#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_JFFS2_FS=m
CONFIG_JFFS2_FS_DEBUG=0
CONFIG_JFFS2_FS_WRITEBUFFER=y
# CONFIG_JFFS2_SUMMARY is not set
# CONFIG_JFFS2_FS_XATTR is not set
CONFIG_JFFS2_COMPRESSION_OPTIONS=y
CONFIG_JFFS2_ZLIB=y
CONFIG_JFFS2_RTIME=y
# CONFIG_JFFS2_RUBIN is not set
# CONFIG_JFFS2_CMODE_NONE is not set
CONFIG_JFFS2_CMODE_PRIORITY=y
# CONFIG_JFFS2_CMODE_SIZE is not set
CONFIG_CRAMFS=m
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
#
# Network File Systems
#
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
CONFIG_NFS_V3_ACL=y
# CONFIG_NFS_V4 is not set
CONFIG_NFS_DIRECTIO=y
CONFIG_NFSD=y
CONFIG_NFSD_V2_ACL=y
CONFIG_NFSD_V3=y
CONFIG_NFSD_V3_ACL=y
# CONFIG_NFSD_V4 is not set
CONFIG_NFSD_TCP=y
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_ACL_SUPPORT=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
# CONFIG_RPCSEC_GSS_KRB5 is not set
# CONFIG_RPCSEC_GSS_SPKM3 is not set
CONFIG_SMB_FS=y
# CONFIG_SMB_NLS_DEFAULT is not set
CONFIG_CIFS=y
# CONFIG_CIFS_STATS is not set
# CONFIG_CIFS_WEAK_PW_HASH is not set
# CONFIG_CIFS_XATTR is not set
# CONFIG_CIFS_DEBUG2 is not set
# CONFIG_CIFS_EXPERIMENTAL is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
# CONFIG_9P_FS is not set
#
# Partition Types
#
# CONFIG_PARTITION_ADVANCED is not set
CONFIG_MSDOS_PARTITION=y
#
# Native Language Support
#
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
CONFIG_NLS_CODEPAGE_850=y
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
CONFIG_NLS_CODEPAGE_932=y
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=y
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
CONFIG_NLS_ISO8859_15=y
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
CONFIG_NLS_UTF8=y
#
# Distributed Lock Manager
#
# CONFIG_DLM is not set
#
# Instrumentation Support
#
# CONFIG_PROFILING is not set
# CONFIG_KPROBES is not set
#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
CONFIG_LOG_BUF_SHIFT=15
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_SCHEDSTATS is not set
# CONFIG_DEBUG_SLAB is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_FRAME_POINTER is not set
CONFIG_FORCED_INLINING=y
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
#
# Page alloc debug is incompatible with Software Suspend on i386
#
# CONFIG_DEBUG_RODATA is not set
CONFIG_4KSTACKS=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_DOUBLEFAULT=y
#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set
#
# Cryptographic options
#
CONFIG_CRYPTO=y
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_MANAGER=y
# CONFIG_CRYPTO_HMAC is not set
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_MD4 is not set
# CONFIG_CRYPTO_MD5 is not set
# CONFIG_CRYPTO_SHA1 is not set
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_GF128MUL is not set
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_586 is not set
# CONFIG_CRYPTO_SERPENT is not set
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_586 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_TEA is not set
CONFIG_CRYPTO_ARC4=y
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_DEFLATE is not set
CONFIG_CRYPTO_MICHAEL_MIC=y
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_TEST is not set
#
# Hardware crypto devices
#
# CONFIG_CRYPTO_DEV_PADLOCK is not set
CONFIG_CRYPTO_DEV_GEODE=m
#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_CRC_CCITT=y
# CONFIG_CRC16 is not set
CONFIG_CRC32=y
# CONFIG_LIBCRC32C is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_PLIST=y
CONFIG_IOMAP_COPY=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_KTIME_SCALAR=y
dmesg:
[ 0.000000] Linux version 2.6.20-rc2 (ranma@navi) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #26 Mon Dec 25 14:00:08 CET 2006
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] sanitize start
[ 0.000000] sanitize end
[ 0.000000] copy_e820_map() start: 0000000000000000 size: 000000000009f000 end: 000000000009f000 type: 1
[ 0.000000] copy_e820_map() type is E820_RAM
[ 0.000000] copy_e820_map() start: 000000000009f000 size: 0000000000001000 end: 00000000000a0000 type: 2
[ 0.000000] copy_e820_map() start: 00000000000dc000 size: 0000000000024000 end: 0000000000100000 type: 2
[ 0.000000] copy_e820_map() start: 0000000000100000 size: 000000001fde0000 end: 000000001fee0000 type: 1
[ 0.000000] copy_e820_map() type is E820_RAM
[ 0.000000] copy_e820_map() start: 000000001fee0000 size: 0000000000015000 end: 000000001fef5000 type: 3
[ 0.000000] copy_e820_map() start: 000000001fef5000 size: 000000000000b000 end: 000000001ff00000 type: 4
[ 0.000000] copy_e820_map() start: 000000001ff00000 size: 0000000000100000 end: 0000000020000000 type: 2
[ 0.000000] copy_e820_map() start: 00000000e0000000 size: 0000000010000000 end: 00000000f0000000 type: 2
[ 0.000000] copy_e820_map() start: 00000000f0008000 size: 0000000000004000 end: 00000000f000c000 type: 2
[ 0.000000] copy_e820_map() start: 00000000fec00000 size: 0000000000010000 end: 00000000fec10000 type: 2
[ 0.000000] copy_e820_map() start: 00000000fed14000 size: 0000000000006000 end: 00000000fed1a000 type: 2
[ 0.000000] copy_e820_map() start: 00000000fed20000 size: 0000000000070000 end: 00000000fed90000 type: 2
[ 0.000000] copy_e820_map() start: 00000000fee00000 size: 0000000000001000 end: 00000000fee01000 type: 2
[ 0.000000] copy_e820_map() start: 00000000ff000000 size: 0000000001000000 end: 0000000100000000 type: 2
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009f000 (usable)
[ 0.000000] BIOS-e820: 000000000009f000 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000dc000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 000000001fee0000 (usable)
[ 0.000000] BIOS-e820: 000000001fee0000 - 000000001fef5000 (ACPI data)
[ 0.000000] BIOS-e820: 000000001fef5000 - 000000001ff00000 (ACPI NVS)
[ 0.000000] BIOS-e820: 000000001ff00000 - 0000000020000000 (reserved)
[ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000f0008000 - 00000000f000c000 (reserved)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fed14000 - 00000000fed1a000 (reserved)
[ 0.000000] BIOS-e820: 00000000fed20000 - 00000000fed90000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[ 0.000000] BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
[ 0.000000] 510MB LOWMEM available.
[ 0.000000] Entering add_active_range(0, 0, 130784) 0 entries of 256 used
[ 0.000000] Zone PFN ranges:
[ 0.000000] DMA 0 -> 4096
[ 0.000000] Normal 4096 -> 130784
[ 0.000000] early_node_map[1] active PFN ranges
[ 0.000000] 0: 0 -> 130784
[ 0.000000] On node 0 totalpages: 130784
[ 0.000000] DMA zone: 32 pages used for memmap
[ 0.000000] DMA zone: 0 pages reserved
[ 0.000000] DMA zone: 4064 pages, LIFO batch:0
[ 0.000000] Normal zone: 989 pages used for memmap
[ 0.000000] Normal zone: 125699 pages, LIFO batch:31
[ 0.000000] DMI present.
[ 0.000000] ACPI: RSDP (v002 IBM ) @ 0x000f6bf0
[ 0.000000] ACPI: XSDT (v001 IBM TP-76 0x00001270 LTP 0x00000000) @ 0x1fee6f9b
[ 0.000000] ACPI: FADT (v003 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fee7000
[ 0.000000] ACPI: SSDT (v001 IBM TP-76 0x00001270 MSFT 0x0100000e) @ 0x1fee71b4
[ 0.000000] ACPI: ECDT (v001 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fef4d46
[ 0.000000] ACPI: TCPA (v001 IBM TP-76 0x00001270 PTL 0x00000001) @ 0x1fef4d98
[ 0.000000] ACPI: MADT (v001 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fef4dca
[ 0.000000] ACPI: MCFG (v001 IBM TP-76 0x00001270 IBM 0x00000001) @ 0x1fef4e24
[ 0.000000] ACPI: BOOT (v001 IBM TP-76 0x00001270 LTP 0x00000001) @ 0x1fef4fd8
[ 0.000000] ACPI: DSDT (v001 IBM TP-76 0x00001270 MSFT 0x0100000e) @ 0x00000000
[ 0.000000] ACPI: PM-Timer IO Port: 0x1008
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
[ 0.000000] Processor #0 6:13 APIC version 20
[ 0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[ 0.000000] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
[ 0.000000] IOAPIC[0]: apic_id 1, version 32, address 0xfec00000, GSI 0-23
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[ 0.000000] ACPI: IRQ0 used by override.
[ 0.000000] ACPI: IRQ2 used by override.
[ 0.000000] ACPI: IRQ9 used by override.
[ 0.000000] Enabling APIC mode: Flat. Using 1 I/O APICs
[ 0.000000] Using ACPI (MADT) for SMP configuration information
[ 0.000000] Allocating PCI resources starting at 30000000 (gap: 20000000:c0000000)
[ 0.000000] Detected 1995.186 MHz processor.
[ 2.815181] Built 1 zonelists. Total pages: 129763
[ 2.815183] Kernel command line: root=/dev/sda6 resume=/dev/sda5 vga=ext parport=auto ide0=noprobe ide1=noprobe libata.atapi_enabled=1 ro
[ 2.815401] mapped APIC to ffff9000 (fee00000)
[ 2.815404] mapped IOAPIC to ffff8000 (fec00000)
[ 2.815406] Enabling fast FPU save and restore... done.
[ 2.815408] Enabling unmasked SIMD FPU exception support... done.
[ 2.815416] Initializing CPU#0
[ 2.815473] CPU 0 irqstacks, hard=c05f3000 soft=c05f2000
[ 2.815476] PID hash table entries: 2048 (order: 11, 8192 bytes)
[ 2.815491] is_hpet_capable()
[ 2.815493] trying to force-enable HPET
[ 2.815498] RCBA already mapped at f0008000
[ 2.815501] HPTC: RCBA Base is 0xf0008000, mapped at 0xffffc000 to 0xfffff000
[ 2.815505] HPTC: RCBA 0x3404 is 0x00000080n<3>Intel HPET force-enabled at 0xfed00000
[ 2.817499] Console: colour VGA+ 80x50
[ 2.821573] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes)
[ 2.821816] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes)
[ 2.831460] Memory: 512836k/523136k available (3392k kernel code, 9880k reserved, 1444k data, 200k init, 0k highmem)
[ 2.831572] virtual kernel memory layout:
[ 2.831573] fixmap : 0xfffb3000 - 0xfffff000 ( 304 kB)
[ 2.831574] vmalloc : 0xe0800000 - 0xfffb1000 ( 503 MB)
[ 2.831575] lowmem : 0xc0000000 - 0xdfee0000 ( 510 MB)
[ 2.831576] .init : 0xc05bb000 - 0xc05ed000 ( 200 kB)
[ 2.831577] .data : 0xc0450065 - 0xc05b90b8 (1444 kB)
[ 2.831579] .text : 0xc0100000 - 0xc0450065 (3392 kB)
[ 2.832061] Checking if this processor honours the WP bit even in supervisor mode... Ok.
[ 2.832297] hpet_enable
[ 2.832382] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[ 2.832602] hpet0: 3 64-bit timers, 14318180 Hz
[ 2.833675] Using HPET for base-timer
[ 2.915669] Calibrating delay using timer specific routine.. 3994.20 BogoMIPS (lpj=6654729)
[ 2.915836] Mount-cache hash table entries: 512
[ 2.915981] CPU: After generic identify, caps: afe9fbff 00100000 00000000 00000000 00000180 00000000 00000000
[ 2.915990] CPU: L1 I cache: 32K, L1 D cache: 32K
[ 2.916101] CPU: L2 cache: 2048K
[ 2.916171] CPU: After all inits, caps: afe9fbff 00100000 00000000 00002040 00000180 00000000 00000000
[ 2.916176] Intel machine check architecture supported.
[ 2.916248] Intel machine check reporting enabled on CPU#0.
[ 2.916320] Compat vDSO mapped to ffffa000.
[ 2.916396] CPU: Intel(R) Pentium(R) M processor 2.00GHz stepping 08
[ 2.916543] Checking 'hlt' instruction... OK.
[ 2.929131] ACPI: Core revision 20060707
[ 2.945497] ENABLING IO-APIC IRQs
[ 2.945753] ..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 3.082402] NET: Registered protocol family 16
[ 3.082649] ACPI: ACPI Dock Station Driver
[ 3.082749] ACPI: bus type pci registered
[ 3.082824] PCI: Using MMCONFIG
[ 3.083561] Setting up standard PCI resources
[ 3.093993] ACPI: Interpreter enabled
[ 3.094065] ACPI: Using IOAPIC for interrupt routing
[ 3.094750] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 9 10 *11)
[ 3.095684] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 9 10 *11)
[ 3.096606] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 9 10 *11)
[ 3.097521] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 9 10 *11)
[ 3.098438] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 9 10 *11)
[ 3.099369] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 9 10 *11)
[ 3.100284] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 *5 6 7 9 10 11)
[ 3.101201] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 9 10 *11)
[ 3.101926] ACPI: PCI Root Bridge [PCI0] (0000:00)
[ 3.102000] PCI: Probing PCI hardware (bus 00)
[ 3.103425] HPTC: RCBA Base is 0xf0008000
[ 3.103498] HPTC: RCBA 0x3404 is 0x80
[ 3.103566] HPTC: HPTC enabled
[ 3.103635] HPTC: HPET located at 0xfed00000
[ 3.103707] PCI quirk: region 1000-107f claimed by ICH6 ACPI/GPIO/TCO
[ 3.103779] PCI quirk: region 1180-11bf claimed by ICH6 GPIO
[ 3.103989] Boot video device is 0000:01:00.0
[ 3.104433] PCI: Transparent bridge - 0000:00:1e.0
[ 3.104583] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
[ 3.109110] ACPI: Power Resource [PUBS] (on)
[ 3.110023] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.AGP_._PRT]
[ 3.110275] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP0._PRT]
[ 3.110438] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.EXP2._PRT]
[ 3.110626] ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.PCI1._PRT]
[ 3.112300] Linux Plug and Play Support v0.97 (c) Adam Belay
[ 3.112376] pnp: PnP ACPI init
[ 3.115896] pnp: PnP ACPI: found 13 devices
[ 3.115984] intel_rng: FWH not detected
[ 3.116140] SCSI subsystem initialized
[ 3.116225] libata version 2.00 loaded.
[ 3.116258] usbcore: registered new interface driver usbfs
[ 3.116349] usbcore: registered new interface driver hub
[ 3.116441] usbcore: registered new device driver usb
[ 3.116549] PCI: Using ACPI for IRQ routing
[ 3.116621] PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
[ 3.215491] Bluetooth: Core ver 2.11
[ 3.215587] NET: Registered protocol family 31
[ 3.215657] Bluetooth: HCI device and connection manager initialized
[ 3.215728] Bluetooth: HCI socket layer initialized
[ 3.216289] ieee1394: Initialized config rom entry `ip1394'
[ 3.216345] PCI: Bridge: 0000:00:01.0
[ 3.216417] IO window: 3000-3fff
[ 3.216487] MEM window: b0100000-b01fffff
[ 3.216557] PREFETCH window: c0000000-c7ffffff
[ 3.216626] PCI: Bridge: 0000:00:1c.0
[ 3.216694] IO window: disabled.
[ 3.216766] MEM window: b0200000-b02fffff
[ 3.216835] PREFETCH window: disabled.
[ 3.216906] PCI: Bridge: 0000:00:1c.2
[ 3.216976] IO window: 4000-4fff
[ 3.217047] MEM window: b2000000-b3ffffff
[ 3.217117] PREFETCH window: c8000000-c80fffff
[ 3.217190] PCI: Bus 12, cardbus bridge: 0000:0b:00.0
[ 3.217260] IO window: 00005000-000050ff
[ 3.217331] IO window: 00005400-000054ff
[ 3.217403] PREFETCH window: d0000000-d3ffffff
[ 3.217474] MEM window: b8000000-bbffffff
[ 3.217545] PCI: Bridge: 0000:00:1e.0
[ 3.217615] IO window: 5000-8fff
[ 3.217686] MEM window: b4000000-bfffffff
[ 3.217757] PREFETCH window: d0000000-d7ffffff
[ 3.217834] ACPI: PCI Interrupt 0000:00:01.0[A] -> GSI 16 (level, low) -> IRQ 16
[ 3.217973] PCI: Setting latency timer of device 0000:00:01.0 to 64
[ 3.217986] ACPI: PCI Interrupt 0000:00:1c.0[A] -> GSI 20 (level, low) -> IRQ 17
[ 3.218125] PCI: Setting latency timer of device 0000:00:1c.0 to 64
[ 3.218140] ACPI: PCI Interrupt 0000:00:1c.2[C] -> GSI 22 (level, low) -> IRQ 18
[ 3.218278] PCI: Setting latency timer of device 0000:00:1c.2 to 64
[ 3.218287] PCI: Setting latency timer of device 0000:00:1e.0 to 64
[ 3.218298] ACPI: PCI Interrupt 0000:0b:00.0[A] -> GSI 16 (level, low) -> IRQ 16
[ 3.218448] NET: Registered protocol family 2
[ 3.248830] IP route cache hash table entries: 4096 (order: 2, 16384 bytes)
[ 3.248952] TCP established hash table entries: 16384 (order: 4, 65536 bytes)
[ 3.249071] TCP bind hash table entries: 8192 (order: 3, 32768 bytes)
[ 3.249167] TCP: Hash tables configured (established 16384 bind 8192)
[ 3.249238] TCP reno registered
[ 3.258925] Simple Boot Flag at 0x35 set to 0x1
[ 3.259018] Machine check exception polling timer started.
[ 3.259287] Installing knfsd (copyright (C) 1996 [email protected]).
[ 3.259478] NTFS driver 2.1.27 [Flags: R/W].
[ 3.259606] io scheduler noop registered
[ 3.259714] io scheduler anticipatory registered (default)
[ 3.259858] io scheduler deadline registered
[ 3.259969] io scheduler cfq registered
[ 3.261652] PCI: Setting latency timer of device 0000:00:01.0 to 64
[ 3.261667] assign_interrupt_mode Found MSI capability
[ 3.261754] Allocate Port Service[0000:00:01.0:pcie00]
[ 3.261774] Allocate Port Service[0000:00:01.0:pcie03]
[ 3.261816] PCI: Setting latency timer of device 0000:00:1c.0 to 64
[ 3.261852] assign_interrupt_mode Found MSI capability
[ 3.261949] Allocate Port Service[0000:00:1c.0:pcie00]
[ 3.261967] Allocate Port Service[0000:00:1c.0:pcie02]
[ 3.261987] Allocate Port Service[0000:00:1c.0:pcie03]
[ 3.262059] PCI: Setting latency timer of device 0000:00:1c.2 to 64
[ 3.262095] assign_interrupt_mode Found MSI capability
[ 3.262197] Allocate Port Service[0000:00:1c.2:pcie00]
[ 3.262215] Allocate Port Service[0000:00:1c.2:pcie02]
[ 3.262233] Allocate Port Service[0000:00:1c.2:pcie03]
[ 3.262314] pci_hotplug: PCI Hot Plug PCI Core version: 0.5
[ 3.262387] ibmphpd: IBM Hot Plug PCI Controller Driver version: 0.6
[ 3.262462] acpiphp: ACPI Hot Plug PCI Controller Driver version: 0.5
[ 3.265962] decode_hpp: Could not get hotplug parameters. Use defaults
[ 3.266059] acpiphp: Slot [1] registered
[ 3.267122] acpiphp_ibm: ibm_find_acpi_device: Failed to get device information<3>acpiphp_ibm: ibm_find_acpi_device: Failed to get device information<3>acpiphp_ibm: ibm_find_acpi_device: Failed to get device information<3>acpiphp_ibm: ibm_acpiphp_init: acpi_walk_namespace failed
[ 3.269969] ACPI: AC Adapter [AC] (on-line)
[ 3.278158] ACPI: Battery Slot [BAT0] (battery present)
[ 3.278284] input: Power Button (FF) as /class/input/input0
[ 3.278358] ACPI: Power Button (FF) [PWRF]
[ 3.278459] input: Lid Switch as /class/input/input1
[ 3.278532] ACPI: Lid Switch [LID]
[ 3.278632] input: Sleep Button (CM) as /class/input/input2
[ 3.278706] ACPI: Sleep Button (CM) [SLPB]
[ 3.278995] ACPI: Video Device [VID] (multi-head: yes rom: no post: no)
[ 3.280635] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
[ 3.280857] ACPI: Processor [CPU] (supports 8 throttling states)
[ 3.281966] ACPI: Thermal Zone [THM0] (63 C)
[ 3.283336] Real Time Clock Driver v1.12ac
[ 3.283432] Linux agpgart interface v0.101 (c) Dave Jones
[ 3.283522] agpgart: Detected an Intel 915GM Chipset.
[ 3.300594] agpgart: AGP aperture is 256M @ 0x0
[ 3.300695] [drm] Initialized drm 1.1.0 20060810
[ 3.300877] tpm_nsc tpm_nscl0: NSC TPM revision 2
[ 3.301002] Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing disabled
[ 3.301249] serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a NS16550A
[ 3.302001] pnp: Device 00:09 activated.
[ 3.302186] 00:09: ttyS0 at I/O 0x3f8 (irq = 4) is a NS16550A
[ 3.302352] ACPI: PCI Interrupt 0000:00:1e.3[B] -> GSI 23 (level, low) -> IRQ 19
[ 3.302495] ACPI: PCI interrupt for device 0000:00:1e.3 disabled
[ 3.302599] parport: PnPBIOS parport detected.
[ 3.302704] parport0: PC-style at 0x3bc (0x7bc), irq 7 [PCSPP(,...)]
[ 3.303238] loop: loaded (max 8 devices)
[ 3.303352] nbd: registered device at major 43
[ 3.303761] Ethernet Channel Bonding Driver: v3.1.1 (September 26, 2006)
[ 3.303837] bonding: Warning: either miimon or arp_interval and arp_ip_target module parameters must be specified, otherwise bonding will not detect link failures! see bonding.txt for details.
[ 3.304025] pcnet32.c:v1.33 27.Jun.2006 [email protected]
[ 3.304115] e100: Intel(R) PRO/100 Network Driver, 3.5.17-k2-NAPI
[ 3.304185] e100: Copyright(c) 1999-2006 Intel Corporation
[ 3.304292] tg3.c:v3.71 (December 15, 2006)
[ 3.304377] ACPI: PCI Interrupt 0000:02:00.0[A] -> GSI 16 (level, low) -> IRQ 16
[ 3.304520] PCI: Setting latency timer of device 0000:02:00.0 to 64
[ 0.399999] eth0: Tigon3 [partno(BCM95751M) rev 4101 PHY(5750)] (PCI Express) 10/100/1000Base-T Ethernet 00:0a:e4:c1:27:01
[ 0.399999] eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] Split[0] WireSpeed[1] TSOcap[1]
[ 0.399999] eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[ 0.399999] PPP generic driver version 2.4.2
[ 0.399999] PPP Deflate Compression module registered
[ 0.399999] PPP BSD Compression module registered
[ 0.403333] NET: Registered protocol family 24
[ 0.403333] tun: Universal TUN/TAP device driver, 1.6
[ 0.403333] tun: (C) 1999-2004 Max Krasnyansky <[email protected]>
[ 0.403333] netconsole: not configured, aborting
[ 0.403333] ahci 0000:00:1f.2: version 2.0
[ 0.403333] ahci: probe of 0000:00:1f.2 failed with error -12
[ 0.403333] ata_piix 0000:00:1f.2: version 2.00ac7
[ 0.403333] ata_piix 0000:00:1f.2: MAP [ P0 P2 IDE IDE ]
[ 0.403333] PCI: Setting latency timer of device 0000:00:1f.2 to 64
[ 0.403333] ata1: SATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x18C0 irq 14
[ 0.403333] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x18C8 irq 15
[ 0.403333] scsi0 : ata_piix
[ 0.563333] ata1.00: ATA-6, max UDMA/100, 195371568 sectors: LBA
[ 0.563333] ata1.00: ata1: dev 0 multi count 16
[ 0.563333] ata1.00: applying bridge limits
[ 0.573333] ata1.00: configured for UDMA/100
[ 0.573333] scsi1 : ata_piix
[ 0.886666] ata2.00: ATAPI, max UDMA/33
[ 1.046666] ata2.00: configured for UDMA/33
[ 1.046666] scsi 0:0:0:0: Direct-Access ATA FUJITSU MHV2100A 0084 PQ: 0 ANSI: 5
[ 1.046666] SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
[ 1.046666] sda: Write Protect is off
[ 1.046666] sda: Mode Sense: 00 3a 00 00
[ 1.046666] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.046666] SCSI device sda: 195371568 512-byte hdwr sectors (100030 MB)
[ 1.046666] sda: Write Protect is off
[ 1.046666] sda: Mode Sense: 00 3a 00 00
[ 1.046666] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 1.046666] sda: sda1 sda2 sda3 < sda5 sda6 sda7 > sda4
[ 1.113333] sd 0:0:0:0: Attached scsi disk sda
[ 1.113333] sd 0:0:0:0: Attached scsi generic sg0 type 0
[ 1.116666] scsi 1:0:0:0: CD-ROM MATSHITA DVD-RAM UJ-830S 1.02 PQ: 0 ANSI: 5
[ 1.123333] sr0: scsi3-mmc drive: 24x/24x writer dvd-ram cd/rw xa/form2 cdda tray
[ 1.123333] Uniform CD-ROM driver Revision: 3.20
[ 1.123333] sr 1:0:0:0: Attached scsi CD-ROM sr0
[ 1.123333] sr 1:0:0:0: Attached scsi generic sg1 type 5
[ 1.123333] ieee1394: raw1394: /dev/raw1394 device initialized
[ 1.123333] Yenta: CardBus bridge found at 0000:0b:00.0 [1014:0532]
[ 1.249999] Yenta: ISA IRQ mask 0x0438, PCI irq 16
[ 1.249999] Socket status: 30000006
[ 1.249999] pcmcia: parent PCI bridge I/O window: 0x5000 - 0x8fff
[ 1.249999] cs: IO port probe 0x5000-0x8fff: clean.
[ 1.249999] pcmcia: parent PCI bridge Memory window: 0xb4000000 - 0xbfffffff
[ 1.249999] pcmcia: parent PCI bridge Memory window: 0xd0000000 - 0xd7ffffff
[ 1.503333] usbcore: registered new interface driver usblp
[ 1.503333] drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver
[ 1.503333] usbcore: registered new interface driver libusual
[ 1.503333] usbcore: registered new interface driver usbhid
[ 1.503333] drivers/usb/input/hid-core.c: v2.6:USB HID core driver
[ 1.503333] usbcore: registered new interface driver asix
[ 1.503333] usbcore: registered new interface driver usbserial
[ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for generic
[ 1.503333] usbcore: registered new interface driver usbserial_generic
[ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial Driver core
[ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for hp4X
[ 1.503333] usbcore: registered new interface driver hp4X
[ 1.503333] drivers/usb/serial/hp4x.c: HP4x (48/49) Generic Serial driver v1.00
[ 1.503333] drivers/usb/serial/usb-serial.c: USB Serial support registered for pl2303
[ 1.503333] usbcore: registered new interface driver pl2303
[ 1.503333] drivers/usb/serial/pl2303.c: Prolific PL2303 USB to serial adaptor driver
[ 1.503333] PNP: PS/2 Controller [PNP0303:KBD,PNP0f13:MOU] at 0x60,0x64 irq 1,12
[ 1.509999] serio: i8042 KBD port at 0x60,0x64 irq 1
[ 1.509999] serio: i8042 AUX port at 0x60,0x64 irq 12
[ 1.509999] mice: PS/2 mouse device common for all mice
[ 1.513333] input: AT Translated Set 2 keyboard as /class/input/input3
[ 1.519999] i2c /dev entries driver
[ 1.519999] ACPI: PCI Interrupt 0000:00:1f.3[A] -> GSI 23 (level, low) -> IRQ 19
[ 1.519999] device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: [email protected]
[ 1.519999] EDAC MC: Ver: 2.0.1 Dec 25 2006
[ 1.549999] Advanced Linux Sound Architecture Driver Version 1.0.14rc1 (Wed Dec 20 08:11:48 2006 UTC).
[ 1.549999] ACPI: PCI Interrupt 0000:00:1e.2[A] -> GSI 22 (level, low) -> IRQ 18
[ 1.549999] PCI: Setting latency timer of device 0000:00:1e.2 to 64
[ 1.816666] ACPI: EC: evaluating _Q75
[ 2.133333] Synaptics Touchpad, model: 1, fw: 5.9, id: 0x2c6ab1, caps: 0x884793/0x0
[ 2.133333] serio: Synaptics pass-through port at isa0060/serio1/input0
[ 2.176666] input: SynPS/2 Synaptics TouchPad as /class/input/input4
[ 2.473333] intel8x0_measure_ac97_clock: measured 53330 usecs
[ 2.473333] intel8x0: clocking to 48000
[ 2.473333] ALSA device list:
[ 2.473333] #0: Intel ICH6 with AD1981B at 0xb0000800, irq 18
[ 2.473333] netem: version 1.2
[ 2.473333] u32 classifier
[ 2.473333] Netfilter messages via NETLINK v0.30.
[ 2.473333] ip_tables: (C) 2000-2006 Netfilter Core Team
[ 2.553333] TCP bic registered
[ 2.553333] TCP cubic registered
[ 2.553333] TCP westwood registered
[ 2.553333] TCP highspeed registered
[ 2.553333] TCP vegas registered
[ 2.553333] NET: Registered protocol family 1
[ 2.553333] NET: Registered protocol family 10
[ 2.553333] IPv6 over IPv4 tunneling driver
[ 2.553333] NET: Registered protocol family 17
[ 2.633333] Bridge firewalling registered
[ 2.633333] Bluetooth: L2CAP ver 2.8
[ 2.633333] Bluetooth: L2CAP socket layer initialized
[ 2.633333] Bluetooth: SCO (Voice Link) ver 0.5
[ 2.633333] Bluetooth: SCO socket layer initialized
[ 2.633333] Bluetooth: RFCOMM socket layer initialized
[ 2.633333] Bluetooth: RFCOMM TTY layer initialized
[ 2.633333] Bluetooth: RFCOMM ver 1.8
[ 2.633333] Bluetooth: BNEP (Ethernet Emulation) ver 1.2
[ 2.633333] Bluetooth: BNEP filters: protocol multicast
[ 2.633333] Bluetooth: HIDP (Human Interface Emulation) ver 1.1
[ 2.633333] 802.1Q VLAN Support v1.8 Ben Greear <[email protected]>
[ 2.633333] All bugs added by David S. Miller <[email protected]>
[ 2.633333] ieee80211: 802.11 data/management/control stack, git-1.1.13
[ 2.633333] ieee80211: Copyright (C) 2004-2005 Intel Corporation <[email protected]>
[ 2.633333] ieee80211_crypt: registered algorithm 'NULL'
[ 2.633333] ieee80211_crypt: registered algorithm 'WEP'
[ 2.633333] ieee80211_crypt: registered algorithm 'CCMP'
[ 2.633333] ieee80211_crypt: registered algorithm 'TKIP'
[ 2.633333] speedstep-centrino with X86_SPEEDSTEP_CENTRINO_ACPIconfig is deprecated.
[ 2.633333] Use X86_ACPI_CPUFREQ (acpi-cpufreq instead.
[ 2.633333] Using IPI Shortcut mode
[ 2.633333] ACPI: (supports S0 S3 S4 S5)
[ 2.636666] Time: tsc clocksource has been installed.
[ 2.643333] Time: hpet clocksource has been installed.
[ 7.439999] IBM TrackPoint firmware: 0x0e, buttons: 3/3
[ 7.696665] input: TPPS/2 IBM TrackPoint as /class/input/input5
[ 7.703332] ACPI: EC: evaluating _Q75
[ 7.879999] kjournald starting. Commit interval 5 seconds
[ 7.879999] EXT3-fs: mounted filesystem with ordered data mode.
[ 7.879999] VFS: Mounted root (ext3 filesystem) readonly.
[ 7.879999] Freeing unused kernel memory: 200k freed
[ 11.453332] ACPI: PCI Interrupt 0000:0b:00.1[B] -> GSI 17 (level, low) -> IRQ 20
[ 11.506665] ohci1394: fw-host0: OHCI-1394 1.0 (PCI): IRQ=[20] MMIO=[b1000000-b10007ff] Max Packet=[2048] IR/IT contexts=[4/4]
[ 11.516665] eth1394: eth0: IEEE-1394 IPv4 over 1394 Ethernet (fw-host0)
[ 11.679998] cs: IO port probe 0x100-0x4ff: excluding 0x4d0-0x4d7
[ 11.683332] cs: IO port probe 0x800-0x8ff: clean.
[ 11.683332] cs: IO port probe 0xc00-0xcff: clean.
[ 11.683332] cs: IO port probe 0xa00-0xaff: clean.
[ 12.166665] Adding 1958000k swap on /dev/sda5. Priority:10 extents:1 across:1958000k
[ 12.319998] EXT3 FS on sda6, internal journal
[ 12.693332] ibm_acpi: ThinkPad EC firmware 76HT16WW-1.06
[ 12.693332] ibm_acpi: IBM ThinkPad ACPI Extras v0.13
[ 12.693332] ibm_acpi: http://ibm-acpi.sf.net/
[ 12.699998] ibm_acpi: fan_init: initial fan status is unknown, assuming it is in auto mode
[ 12.783332] ieee1394: Host added: ID:BUS[0-00:1023] GUID[000ae405314003e1]
[ 13.063332] kjournald starting. Commit interval 5 seconds
[ 13.063332] EXT3-fs: mounted filesystem with ordered data mode.
[ 13.066665] kjournald starting. Commit interval 5 seconds
[ 13.066665] EXT3 FS on sda7, internal journal
[ 13.066665] EXT3-fs: mounted filesystem with ordered data mode.
[ 13.493331] pcmcia: Detected deprecated PCMCIA ioctl usage from process: discover.
[ 13.493331] pcmcia: This interface will soon be removed from the kernel; please expect breakage unless you upgrade to new tools.
[ 13.493331] pcmcia: see http://www.kernel.org/pub/linux/utils/kernel/pcmcia/pcmcia.html for details.
[ 15.979998] ieee1394: Node removed: ID:BUS[0-00:1023] GUID[000ae405314003e1]
--
Tobias PGP: http://9ac7e0bc.uguu.de
On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote:
>
>
> On Sun, 24 Dec 2006, Andrei Popa wrote:
> >
> > Hash check on download completion found bad chunks, consider using
> > "safe_sync".
>
> Dang. Did you get any warning messages from the kernel?
>
> Linus
BTW, rmap.c patch is broken - needs at least
Signed-off-by: Al Viro <[email protected]>
---
diff --git a/mm/rmap.c b/mm/rmap.c
index 57306fa..669acb2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -452,7 +452,7 @@ static int page_mkclean_one(struct page
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
- set_pte_at(vma, address, pte, entry);
+ set_pte_at(mm, address, pte, entry);
lazy_mmu_prot_update(entry);
ret = 1;
}
On Tue, Dec 26, 2006 at 05:51:55PM +0000, Al Viro wrote:
> On Sun, Dec 24, 2006 at 12:24:46PM -0800, Linus Torvalds wrote:
> >
> >
> > On Sun, 24 Dec 2006, Andrei Popa wrote:
> > >
> > > Hash check on download completion found bad chunks, consider using
> > > "safe_sync".
> >
> > Dang. Did you get any warning messages from the kernel?
> >
> > Linus
>
> BTW, rmap.c patch is broken - needs at least
... but that doesn't affect most of the architectures - only sparc64 and
some of powerpc. So it's definitely not enough.
On Tue, 26 Dec 2006, Nick Piggin wrote:
> Linus Torvalds wrote:
> >
> > Ok, so how about this diff.
> >
> > I'm actually feeling good about this one. It really looks like
> > "do_no_page()" was simply buggy, and that this explains everything.
>
> Still trying to catch up here, so I'm not going to reply to any old
> stuff and just start at the tip of the thread... Other than to say
> that I really like cancel_page_dirty ;)
Yeah, I think that part is a bit clearer about what's going on now.
> I think your patch is quite right so that's a good catch.
Actually, since people told me it didn't matter, I went back and looked at
_why_ - the thing is, "vma->vm_page_prot" should always be read-only
anyway, except for mappings that don't do dirty accounting at all, so I
think my patch only found cases that are unimportant (ie pages that get
faulted on on filesystems like ramfs that doesn't do any dirty page
accounting because they're all dirty anyway).
> But I'm not too surprised that it does not help the problem, because I
> don't think we have started shedding any old pte_dirty tests at
> unmap/reclaim-time, have we? So the dirty bit isn't going to get lost,
> as such.
True. We should no longer _need_ those dirty bit reclaims at
unmap/reclaim, but we still do them, so you're right, even if we were
buggy in this area, it should only really matter for the dirty page
counting, not for any lost data.
> I was hoping that you've almost narrowed it down to the filesystem
> writeback code, with the last few mails?
I think so, yes.
However, I've checked, and "rtorrent" really does seem to be fairly
well-behaved wrt any filesystem activity. It does
- no threading. It's 100% single-threaded, and doesn't even appear to use
signals.
- exactly _one_ "ftruncate()", and it does it at the beginning, for the
full final size.
IOW, it's not anything subtle with truncate and dirty page cancel.
- It never uses mprotect on the shared mappings, but it _does_ do:
"mincore()" - but the return values don't much matter (it's used
as a heuristic on which parts to hash, apparently)
I double- and triple-checked this one, because I
did make changes to "mincore()", but those didn't go
into the affected kernels anyway (ie they are not in
plain 2.6.19, nor in 2.6.18.3 either)
"madvise(MADV_WILLNEED)"
"msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag)
"munmap()" of course
- it never seems to mix mmap() and write() - it does _only_ mmap.
- it seems to mmap/munmap the shared files in nice 64-page chunks, all
64-page aligned in the file (ie it does NOT create one big mapping, it
has some kind of LRU of thse 64-page chunks). The only exception being
the last chunk, which it maps byte-accurate to the size.
- I haven't checked whether it only ever has the same chunk mapped once
at a time.
Anyway, the _one_ half-way interesting thing is the fact that it doesn't
allocate any backing store at all for the file, and as such the page
writeback needs to create all the underlying buffers on the filesystem. I
really don't see why that would be a problem either, but I could imagine
that if we have some writeback bug where we can end up writing back the
_same_ page concurrently, we'd actually end up racing in the kernel, and
allocating two different backing stores, and then maybe the other one
would effectively "get lost" (and the earlier writeback would win the
race, explaining why we'd end up with zeroes at the end of a block).
Or something.
However, all the codepaths _seem_ to test for PG_writeback, and not even
try to start another writeback while the first one is still active.
What would also actually be interesting is whether somebody can reproduce
this on Reiserfs, for example. I _think_ all the reports I've seen are on
ext2 or ext3, and if this is somehow writeback-related, it could be some
bug that is just shared between the two by virtue of them still having a
lot of stuff in common.
Linus
From: Tobias Diedrich <[email protected]>
Date: Tue, 26 Dec 2006 17:17:00 +0100
> Linus Torvalds wrote:
> > I don't think it's a page table issue any more, it just doesn't look
> > likely with the ARM UP corruption. It's also not apparently even on a
> > cacheline boundary, so it probably is really a dirty bit that got cleared
> > wrogn due to some race with IO.
>
> So, until now it's only been reported for SMP on i386?
> I'm seeing the issue on my Pentium-M Notebook (Thinkpad R52) over
> here, UP kernel, no preempt.
I've seen it on sparc64, UP kernel, no preempt.
On Tue, 26 Dec 2006, David Miller wrote:
>
> I've seen it on sparc64, UP kernel, no preempt.
Btw, having tried to debug the writeback code, there's one very special
case that just makes me go "hmm".
If we have a buffer that is "busy" when we try to write back a page, we
have this magic "wbc->sync_mode == WB_SYNC_NONE && wbc->nonblocking" mode,
where we won't wait for it, but instead we'll redirty the page and redo
the whole thing.
Looking at the code, that should all work, but at the same time, it
triggers some of my debug messages about having a dirty page during
writeback, and one way to trigger that debug message is to try to run
rtorrent on the machine..
I dunno. Witht he writeback being suspicious, and the normal
"block_write_full_page()" path being implicated in at least ext2, I just
wonder. This is one of those "let's see if behaviour changes" patches,
that I'm just throwing out there..
Linus
---
diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..4652ef1 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
if (!buffer_mapped(bh))
continue;
- /*
- * If it's a fully non-blocking write attempt and we cannot
- * lock the buffer then redirty the page. Note that this can
- * potentially cause a busy-wait loop from pdflush and kswapd
- * activity, but those code paths have their own higher-level
- * throttling.
- */
- if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
- lock_buffer(bh);
- } else if (test_set_buffer_locked(bh)) {
- redirty_page_for_writepage(wbc, page);
- continue;
- }
+ lock_buffer(bh);
if (test_clear_buffer_dirty(bh)) {
mark_buffer_async_write(bh);
} else {
I have corrupted files...
> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 263f88e..4652ef1 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1653,19 +1653,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
> do {
> if (!buffer_mapped(bh))
> continue;
> - /*
> - * If it's a fully non-blocking write attempt and we cannot
> - * lock the buffer then redirty the page. Note that this can
> - * potentially cause a busy-wait loop from pdflush and kswapd
> - * activity, but those code paths have their own higher-level
> - * throttling.
> - */
> - if (wbc->sync_mode != WB_SYNC_NONE || !wbc->nonblocking) {
> - lock_buffer(bh);
> - } else if (test_set_buffer_locked(bh)) {
> - redirty_page_for_writepage(wbc, page);
> - continue;
> - }
> + lock_buffer(bh);
> if (test_clear_buffer_dirty(bh)) {
> mark_buffer_async_write(bh);
> } else {
On 12/27/06, Linus Torvalds <[email protected]> wrote:
<snip>
> - It never uses mprotect on the shared mappings, but it _does_ do:
> "mincore()" - but the return values don't much matter (it's used
> as a heuristic on which parts to hash, apparently)
>
> I double- and triple-checked this one, because I
> did make changes to "mincore()", but those didn't go
> into the affected kernels anyway (ie they are not in
> plain 2.6.19, nor in 2.6.18.3 either)
Correct, mincore is only used to check if it should delay the hash checking.
> "madvise(MADV_WILLNEED)"
> "msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag)
> "munmap()" of course
>
> - it never seems to mix mmap() and write() - it does _only_ mmap.
>
> - it seems to mmap/munmap the shared files in nice 64-page chunks, all
> 64-page aligned in the file (ie it does NOT create one big mapping, it
> has some kind of LRU of thse 64-page chunks). The only exception being
> the last chunk, which it maps byte-accurate to the size.
The length of the chunks is only page aligned on single file torrents,
not so on multi-file torrents. I've attached a patch for rtorrent that
will extend the length to the page boundary.
> - I haven't checked whether it only ever has the same chunk mapped once
> at a time.
This should be the case, but two mapped chunks may share a page,
sometimes with different r/w permissions.
Jari Sundell
On Tue, Dec 26, 2006 at 11:26:50AM -0800, Linus Torvalds wrote:
> What would also actually be interesting is whether somebody can reproduce
> this on Reiserfs, for example. I _think_ all the reports I've seen are on
> ext2 or ext3, and if this is somehow writeback-related, it could be some
> bug that is just shared between the two by virtue of them still having a
> lot of stuff in common.
>
> Linus
I do get this error on reiserfs ( old one, didn't try on reiser4 ).
Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the
debian bts.
flo attenberger
---
Linux master 2.6.19 #1 PREEMPT Thu Dec 21 10:55:34 CET 2006 x86_64
GNU/Linux
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.19
# Thu Dec 21 10:45:05 2006
#
CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_ZONE_DMA32=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_DMI=y
CONFIG_AUDIT_ARCH=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
#
# General setup
#
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
# CONFIG_IPC_NS is not set
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
# CONFIG_TASKSTATS is not set
# CONFIG_UTS_NS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_RELAY is not set
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_SLAB=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
# CONFIG_SLOB is not set
#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
#
# Block layer
#
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set
#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=m
CONFIG_IOSCHED_DEADLINE=m
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
#
# Processor type and features
#
CONFIG_X86_PC=y
# CONFIG_X86_VSMP is not set
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
# CONFIG_SMP is not set
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_FLATMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_FLATMEM_MANUAL=y
# CONFIG_DISCONTIGMEM_MANUAL is not set
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_FLATMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_HPET_TIMER=y
CONFIG_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCE_INTEL is not set
CONFIG_X86_MCE_AMD=y
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_REORDER=y
CONFIG_K8_NB=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y
#
# Power management options
#
CONFIG_PM=y
CONFIG_PM_LEGACY=y
# CONFIG_PM_DEBUG is not set
CONFIG_PM_SYSFS_DEPRECATED=y
# CONFIG_SOFTWARE_SUSPEND is not set
#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
CONFIG_ACPI_SLEEP_PROC_FS=y
# CONFIG_ACPI_SLEEP_PROC_SLEEP is not set
CONFIG_ACPI_AC=m
# CONFIG_ACPI_BATTERY is not set
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_HOTKEY=m
CONFIG_ACPI_FAN=m
# CONFIG_ACPI_DOCK is not set
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_THERMAL=m
# CONFIG_ACPI_ASUS is not set
# CONFIG_ACPI_IBM is not set
# CONFIG_ACPI_TOSHIBA is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_EC=y
CONFIG_ACPI_POWER=y
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
# CONFIG_ACPI_CONTAINER is not set
# CONFIG_ACPI_SBS is not set
#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=m
CONFIG_CPU_FREQ_STAT_DETAILS=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=m
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
#
# CPUFreq processor drivers
#
CONFIG_X86_POWERNOW_K8=m
CONFIG_X86_POWERNOW_K8_ACPI=y
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_ACPI_CPUFREQ=m
#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
# CONFIG_X86_SPEEDSTEP_LIB is not set
#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
# CONFIG_PCIEPORTBUS is not set
# CONFIG_PCI_MSI is not set
# CONFIG_PCI_DEBUG is not set
# CONFIG_HT_IRQ is not set
#
# PCCARD (PCMCIA/CardBus) support
#
# CONFIG_PCCARD is not set
#
# PCI Hotplug Support
#
CONFIG_HOTPLUG_PCI=m
CONFIG_HOTPLUG_PCI_FAKE=m
# CONFIG_HOTPLUG_PCI_ACPI is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set
#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
#
# Networking
#
CONFIG_NET=y
#
# Networking options
#
# CONFIG_NETDEBUG is not set
CONFIG_PACKET=m
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_USER=m
# CONFIG_XFRM_SUB_POLICY is not set
CONFIG_NET_KEY=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_FWMARK=y
CONFIG_IP_ROUTE_MULTIPATH=y
# CONFIG_IP_ROUTE_MULTIPATH_CACHED is not set
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
# CONFIG_NET_IPGRE_BROADCAST is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_ARPD=y
CONFIG_SYN_COOKIES=y
CONFIG_INET_AH=m
CONFIG_INET_ESP=m
CONFIG_INET_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_TUNNEL=m
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_DIAG=m
CONFIG_INET_TCP_DIAG=m
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=y
CONFIG_TCP_CONG_CUBIC=m
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
CONFIG_DEFAULT_BIC=y
# CONFIG_DEFAULT_CUBIC is not set
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="bic"
#
# IP: Virtual Server Configuration
#
# CONFIG_IP_VS is not set
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
# CONFIG_IPV6_ROUTER_PREF is not set
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
# CONFIG_IPV6_MIP6 is not set
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=m
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_NETLABEL is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
# CONFIG_NETFILTER_XT_TARGET_DSCP is not SCSI low-level drivers
#
# CONFIG_ISCSI_TCP is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC7XXX_OLD is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_MEGARAID_NEWGEN is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set
#
# Serial ATA (prod) and Parallel ATA (experimental) drivers
#
CONFIG_ATA=y
# CONFIG_SATA_AHCI is not set
# CONFIG_SATA_SVW is not set
# CONFIG_ATA_PIIX is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
CONFIG_SATA_PROMISE=m
# CONFIG_SATA_SX4 is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIL24 is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
CONFIG_SATA_VIA=y
# CONFIG_SATA_VITESSE is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_ATA_GENERIC is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# C# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
#
# ISDN subsystem
#
CONFIG_ISDN=m
#
# Old ISDN4Linux
#
CONFIG_ISDN_I4L=m
CONFIG_ISDN_PPP=y
CONFIG_ISDN_PPP_VJ=y
CONFIG_ISDN_MPP=y
# CONFIG_IPPP_FILTER is not set
CONFIG_ISDN_PPP_BSDCOMP=m
CONFIG_ISDN_AUDIO=y
CONFIG_ISDN_TTY_FAX=y
#
# ISDN feature submodules
#
# CONFIG_ISDN_DRV_LOOP is not set
CONFIG_ISDN_DIVERSION=m
#
# ISDN4Linux hardware drivers
#
#
# Passive cards
#
CONFIG_ISDN_DRV_HISAX=m
#
# D-channel protocol features
#
CONFIG_HISAX_EURO=y
CONFIG_DE_AOC=y
# CONFIG_HISAX_NO_SENDCOMPLETE is not set
# CONFIG_HISAX_NO_LLC is not set
# CONFIG_HISAX_NO_KEYPAD is not set
# CONFIG_HISAX_1TR6 is not set
# CONFIG_HISAX_NI1 is not set
CONFIG_HISAX_MAX_CARDS=8
#
# HiSax supported cards
#
# CONFIG_HISAX_16_3 is not set
# CONFIG_HISAX_TELESPCI is not set
# CONFIG_HISAX_S0BOX is not set
CONFIG_HISAX_FRITZPCI=y
# CONFIG_HISAX_AVM_A1_PCMCIA is not set
# CONFIG_HISAX_ELSA is not set
# CONFIG_HISAX_DIEHLDIVA is not set
# CONFIG_HISAX_SEDLBAUER is not set
# CONFIG_HISAX_NETJET is not set
# CONFIG_HISAX_NETJET_U is not set
# CONFIG_HISAX_NICCY is not set
# CONFIG_HISAX_BKM_A4T is not set
# CONFIG_HISAX_SCT_QUADRO is not set
# CONFIG_HISAX_GAZEL is not set
# CONFIG_HISAX_HFC_PCI is not set
# CONFIG_HISAX_W6692 is not set
# CONFIG_HISAX_HFC_SX is not set
# CONFIG_HISAX_DEBUG is not set
#
# HiSax PCMCIA card service modules
#
#
# HiSax sub driver modules
#
# CONFIG_HISAX_ST5481 is not set
# CONFIG_HISAX_HFCUSB is not set
# CONFIG_HISAX_HFC4S8S is not set
CONFIG_HISAX_FRITZ_PCIPNP=m
#
# Active cards
#
# CONFIG_HYSDN is not set
#
# Siemens Gigaset
#
# CONFIG_ISDN_DRV_GIGASET is not set
#
# CAPI subsystem
#
CONFIG_ISDN_CAPI=m
# CONFIG_ISDN_DRV_AVMB1_VERBOSE_REASON is not set
CONFIG_ISDN_CAPI_MIDDLEWARE=y
CONFIG_ISDN_CAPI_CAPI20=m
CONFIG_ISDN_CAPI_CAPIFS_BOOL=y
CONFIG_ISDN_CAPI_CAPIFS=m
# CONFIG_ISDN_CAPI_CAPIDRV is not set
#
# CAPI hardware drivers
#
#
# Active AVM cards
#
# CONFIG_CAPI_AVM is not set
#
# Active Eicon DIVA Server cards
#
# CONFIG_CAPI_EICON is not set
#
# Telephony Support
#
# CONFIG_PHONE is not set
#
# Input device support
#
CONFIG_INPUT=y
# CONFIG_INPUT_FF_MEMLESS is not set
#
# Usepport
#
# CONFIG_SPI is not set
# CONFIG_SPI_MASTER is not set
#
# Dallas's 1-wire bus
#
CONFIG_W1=m
#
# 1-wire Bus Masters
#
# CONFIG_W1_MASTER_MATROX is not set
# CONFIG_W1_MASTER_DS2490 is not set
# CONFIG_W1_MASTER_DS2482 is not set
#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
CONFIG_W1_SLAVE_SMEM=m
CONFIG_W1_SLAVE_DS2433=m
# CONFIG_W1_SLAVE_DS2433_CRC is not set
#
# Hardware Monitoring support
#
CONFIG_HWMON=m
CONFIG_HWMON_VID=m
# CONFIG_SENSORS_ABITUGURU is not set
CONFIG_SENSORS_ADM1021=m
CONFIG_SENSORS_ADM1025=m
CONFIG_SENSORS_ADM1026=m
CONFIG_SENSORS_ADM1031=m
CONFIG_SENSORS_ADM9240=m
CONFIG_SENSORS_K8TEMP=m
CONFIG_SENSORS_ASB100=m
# CONFIG_SENSORS_ATXP1 is not set
CONFIG_SENSORS_DS1621=m
# CONFIG_SENSORS_F71805F is not set
CONFIG_SENSORS_FSCHER=m
CONFIG_SENSORS_FSCPOS=m
CONFIG_SENSORS_GL518SM=m
CONFIG_SENSORS_GL520SM=m
CONFIG_SENSORS_IT87=m
CONFIG_SENSORS_LM63=m
CONFIG_SENSORS_LM75=m
CONFIG_SENSORS_LM77=m
CONFIG_SENSORS_LM78=m
CONFIG_SENSORS_LM80=m
CONFIG_SENSORS_LM83=m
CONFIG_SENSORS_LM85=m
CONFIG_SENSORS_LM87=m
CONFIG_SENSORS_LM90=m
CONFIG_SENSORS_LM92=m
CONFIG_SENSORS_MAX1619=m
CONFIG_SENSORS_PC87360=m
CONFIG_SENSORS_SIS5595=m
CONFIG_SENSORS_SMSC47M1=m
# CONFIG_SENSORS_SMSC47M192 is not set
CONFIG_SENSORS_SMSC47B397=m
CONFIG_SENSORS_VIA686A=m
CONFIG_SENSORS_VT1211=m
CONFIG_SENSORS_VT8231=m
CONFIG_SENSORS_W83781D=m
# CONFIG_SENSORS_W83791D is not set
CONFIG_SENSORS_W83792D=m
CONFIG_SENSORS_W83L785TS=m
CONFIG_SENSORS_W83627HF=m
CONFIG_SENSORS_W83627EHF=m
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
#
# Multimedia devices
#
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
CONFIG_VIDEO_V4L2=y
#
# Video Capture Adapters
#
#
# Video Capture Adapters
#
# CONFIG_VIDEO_ADV_DEBUG is not set
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
CONFIG_VIDEO_TVAUDIO=m
CONFIG_VIDEO_TDA7432=m
CONFIG_VIDEO_TDA9875=m
CONFIG_VIDEO_MSP3400=m
# CONFIG_VIDEO_VIVI is not set
CONFIG_VIDEO_BT848=m
# CONFIG_VIDEO_BT848_DVB is not set
CONFIG_VIDEO_SAA6588=m
# CONFIG_VIDEO_BWQCAM is not set
# CONFIG_VIDEO_CQCAM is not set
# CONFIG_VIDEO_W9966 is not set
# CONFIG_VIDEO_CPIA is not set
# CONFIG_VIDEO_CPIA2 is not set
CONFIG_VIDEO_SAA5246A=m
CONFIG_VIDEO__UART=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_AC97_BUS=m
CONFIG_SND_DUMMY=m
CONFIG_SND_VIRMIDI=m
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
CONFIG_SND_MPU401=m
#
# PCI devices
#
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AZT3328 is not set
CONFIG_SND_BT87X=m
# CONFIG_SND_BT87X_OVERCLOCK is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
CONFIG_SND_ENS1371=m
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
# CONFIG_SND_HDA_INTEL is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
CONFIG_SND_VIA82XX=m
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_AC97_POWER_SAVE=y
#
# USB devices
#
CONFIG_SND_USB_AUDIO=m
# CONFIG_SND_USB_USX2Y is not set
#
# Open Sound System
#
# CONFIG_SOUND_PRIME is not set
#
# USB support
#
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=m
# CONFIG_USB_DEBUG is not set
#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
# CONFIG_USB_BANDWIDTH is not set
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
#
# USB Host Controller Drivers
#
CONFIG_USB_EHCI_HCD=m
CONFIG_USB_EHCI_SPLIT_ISO=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
# CONFIG_USB_EHCI_TT_NEWSCHED is not set
# CONFIG_USB_ISP116X_HCD is not set
CONFIG_USB_OHCI_HCD=m
# CONFIG_USB_OHCI_BIG_ENDIAN is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set
#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=m
#
# NOTE: USB_STORAGE enables SCSI, and 'SCSI disk support'
#
#
# may also be needed; see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_DPCM is not set
# CONFIG_USB_STORAGE_USBAT is not set
# CONFIG_USB_STORAGE_SDDR09 is not set
# CONFIG_USB_STORAGE_SDDR55 is not set
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_LIBUSUAL is not set
#
# USB Input Devices
#
CONFIG_USB_HID=m
CONFIG_USB_HIDINPUT=y
# CONFIG_USB_HIDINPUT_POWERBOOK is not set
# CONFIG_HID_FF is not set
CONFIG_USB_HIDDEV=y
#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set
# CONFIG_USB_AIPTEK is not set
# CONFIG_USB_WACOM is not set
# CONFIG_USB_ACECAD is not set
# CONFIG_USB_KBTAB is not set
# CONFIG_USB_POWERMATE is not set
# CONFIG_USB_TOUCHSCREEN is not set
# CONFIG_USB_YEALINK is not set
# CONFIG_USB_XPAD is not set
# CONFIG_USB_ATI_REMOTE is not set
# CONFIG_USB_ATI_REMOTE2 is not set
# CONFIG_USB_KEYSPAN_REMOTE is not set
# CONFIG_USB_APPLETOUCH is not set
#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET_MII is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_MON is not set
#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
#
# USB Serial Converter support
#
# CONFIG_USB_SERIAL is not set
#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_AUERSWALD is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_TEST is not set
#
# USB DSL modem support
#
#
# USB Gadget Support
#
# CONFIG_USB_GADGET is not set
#
# MMC/SD Card support
#
# CONFIG_MMC is not set
#
# LED devices
#
# CONFIG_NEW_LEDS is not set
#
# LED drivers
#
#
# LED Triggers
#
#
# InfiniBand support
#
# CONFIG_INFINIBAND is not set
#
# EDAC - error detection and reporting (RAS) (EXPERIMENTAL)
#
# CONFIG_EDAC is not set
#
# Real Time Clock
#
CONFIG_RTC_LIB=m
CONFIG_RTC_CLASS=m
#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=m
CONFIG_RTC_INTF_PROC=m
CONFIG_RTC_INTF_DEV=m
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
#
# RTC drivers
#
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_DS1307=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_M48T86=m
CONFIG_RTC_DRV_TEST=m
CONFIG_RTC_DRV_V3020=m
#
# DMA Engine support
#
# CONFIG_DMA_ENGINE is not set
#
# DMA Clients
#
#
# DMA Devices
#
#
# Firmware Drivers
#
# CONFIG_EDD is not set
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=m
# CONFIG_EXT3_FS_XATTR is not set
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=m
# CONFIG_JBD_DEBUG is not set
CONFIG_REISER4_FS=y
# CONFIG_REISER4_DEBUG is not set
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
# CONFIG_REISERFS_FS_XATTR is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_MINIX_FS=m
CONFIG_ROMFS_FS=m
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
CONFIG_DNOTIFY=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_ZISOFS_FS=m
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y
#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
CONFIG_MSDOS_FS=m
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=850
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-15"
CONFIG_NTFS_FS=m
# CONFIG_NTFS_DEBUG is not set
CONFIG_NTFS_RW=y
#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_RAMFS=y
# CONFIG_CONFIGFS_FS is not set
#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=m
# CONFIG_VXFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
#
# Network File Systems
#
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
CONFIG_NFS_V4=y
# CONFIG_NFS_DIRECTIO is not set
CONFIG_NFSD=m
CONFIG_NFSD_V3=y
# CONFIG_ONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_NULL=m
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_WP512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_MICHAEL_MIC=m
CONFIG_CRYPTO_CRC32C=m
CONFIG_CRYPTO_TEST=m
#
# Hardware crypto devices
#
#
# Library routines
#
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC32=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y
On 12/27/06, [email protected] <[email protected]> wrote:
> I do get this error on reiserfs ( old one, didn't try on reiser4 ).
> Stock 2.6.19 plus reiser4 patch. Previously reported by me only in the
> debian bts.
I've had reports of corrupted data on earlier kernel releases with
reiserfs3, which were fixed by upgrading to reiserfs4.
Jari Sundell
On Thu, 2006-12-21 at 12:01 -0800, Linus Torvalds wrote:
> What do you guys think? Does something like this work out for S/390 too? I
> tried to make that "ptep_flush_dirty()" concept work for architectures
> that hide the dirty bit somewhere else too, but..
For s390 there are two aspects to consider:
1) the pte values are 100% software controlled. They only change because
a cpu stored a value to it or issued one of the specialized instructions
(csp, ipte and idte). The ptep_flush_dirty would be a nop for s390.
2) ptep_exchange is a bit dangerous. For s390 we need a lock that
protects the software controlled updates of the ptes. The reason is the
ipte instruction. It is implemented by the machine microcode in a
non-atomic way in regard to the memory. It reads the byte of the pte
that contains the invalid bit, flushes the tlb entries for it and then
writes back the byte with the invalid bit set. The microcode makes sure
that this pte cannot be used for form a new tlb on any cpu while the
ipte is in progress.
That means a compare-and-swap semantics on ptes won't work together with
the ipte optimization. As long as there is the pte lock that protects
all software accesses to the pte we are fine. But if any code expects
that ptep_exchange does something like an xchg things break.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Tue, 26 Dec 2006, David Miller wrote:
>
> I've seen it on sparc64, UP kernel, no preempt.
Ok, I still don't have a clue, but I think I at least have a new
test-case.
It can probably be improved upon, but this would _seem_ to trigger the
problem. Can people check?
You'd want to make sure you get page-put activity, by making TARGETSIZE be
big enough to cause memory pressure (and rather than making it bigger, you
might want to make your memory smaller instead, to make it run more
quickly. Either using "mem=128M" or big compiles or something...).
If it finds corruption, you'll see something like
Writing chunk 183858/183859 (99%)
Chunk ..
Chunk 120887 corrupted
Chunk 122372 corrupted
Chunk ...
Checking chunk 183858/183859 (99%)
otherwise it will just say
Writing chunk 183858/183859 (99%)
Checking chunk 183858/183859 (99%)
and exit.
I didn't spend a lot of time verifying this, but I _was_ able to cause
those "Chunk xxx corrupted" messages with this. There's probably a more
efficient better way to do it, but this is better than trying to use
rtorrent, and also makes any worries about what rtorrent does go away.
Of course, maybe it's this test-program that is buggy now, although it
looks trivial enough that I don't think it is.
I think my earlier stress-tester may not have triggered this, because it
just did all its writing in a linear order, so any LRU logic will happen
to write back old pages that we are no longer touching. The randomization
(and using a chunksize that isn't a multiple of a page-size) makes sure
that we're actually going to have lots of rewriting going on.
I think the test-case could probably be improved by having a munmap() and
page-cache flush in between the writing and the checking, to see whether
that shows the corruption easier (and possibly without having to start
paging in order to throw the pages out, which would simplify testing a
lot). But I haven't tested. I decided to post this asap, now that I've
recreated the corruption with something else, and something that is
possibly easier to analyze..
Linus
----
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define TARGETSIZE (256 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}
static void checkmem(void *start, int nr)
{
unsigned char c = nr, *p = start;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
if (*p++ != c) {
printf("Chunk %d corrupted \n", nr);
return;
}
}
}
int main(int argc, char **argv)
{
char *mapping;
int fd, i;
static int chunkorder[NRCHUNKS];
/*
* Make some random ordering of writing the chunks to the
* memory map..
*
* Start with fully ordered..
*/
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;
/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (-1 == (int)(long)mapping)
return -1;
for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
return 0;
}
On Wed, 27 Dec 2006, Linus Torvalds wrote:
>
> I think the test-case could probably be improved by having a munmap() and
> page-cache flush in between the writing and the checking, to see whether
> that shows the corruption easier (and possibly without having to start
> paging in order to throw the pages out, which would simplify testing a
> lot).
I think the page-writeout is implicated, because I do seem to need it, but
the page-cache flush does seem to make corruption _easier_ to see. I now
seem about to trigger it with a 100MB file on a 256MB machine in a minute
or so, with this slight modification.
I still don't see _why_, though. But maybe smarter people than me can see
it..
Linus
---
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}
static void checkmem(void *start, int nr)
{
unsigned char c = nr, *p = start;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
if (*p++ != c) {
printf("Chunk %d corrupted \n", nr);
return;
}
}
}
static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}
int main(int argc, char **argv)
{
char *mapping;
int fd, i;
static int chunkorder[NRCHUNKS];
/*
* Make some random ordering of writing the chunks to the
* memory map..
*
* Start with fully ordered..
*/
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;
/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = remap(fd, NULL);
if (-1 == (int)(long)mapping)
return -1;
for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
/* Unmap, drop, and remap.. */
mapping = remap(fd, mapping);
/* .. and check */
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
return 0;
}
On Thu, 28 Dec 2006, Martin Schwidefsky wrote:
>
> For s390 there are two aspects to consider:
> 1) the pte values are 100% software controlled.
That's fine. In that situation, you shouldn't need any atomic ops at all,
I think all our sw page-table operations are already done under the pte
lock.
The reason x86 needs to be careful is exactly the fact that the hardware
will obviously do a lot on its own, and the hardware is _not_ going to
honor our page table locking ;)
In an all-sw situation, a lot of this should be easier. S390 has _other_
things that are inconvenient (the strange "dirty bit is not in the page
tables" thing that makes it look different from everybody else), but hey,
it's a balance..
So for s390, ptep_exchange() in my example should be able to be a simple
"load old value and store new one", assuming everybody honors the pte lock
(and they _should_).
Linus
From: Linus Torvalds <[email protected]>
Date: Wed, 27 Dec 2006 16:42:40 -0800 (PST)
> That's fine. In that situation, you shouldn't need any atomic ops at all,
> I think all our sw page-table operations are already done under the pte
> lock.
This is true, but there is one subtlety to this I want to
point out in passing.
That lock can possibly only protect a page of PTEs.
When NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS, the locking is done per page
of PTEs, not for all of the page tables of an address space at once.
What this means is that it's really difficult to forcefully block out
all page table operations for a given mm, and I actually needed to do
something like this on sparc64 (when growing the TLB lookup hash
table, you can't let any PTEs change state while the table is
changing). For my case, I added a spinlock to mm->context since
actually what I need is to block modifications to the hash table
itself during PTE changes.
From: Linus Torvalds <[email protected]>
Date: Wed, 27 Dec 2006 16:39:43 -0800 (PST)
>
>
> On Wed, 27 Dec 2006, Linus Torvalds wrote:
> >
> > I think the test-case could probably be improved by having a munmap() and
> > page-cache flush in between the writing and the checking, to see whether
> > that shows the corruption easier (and possibly without having to start
> > paging in order to throw the pages out, which would simplify testing a
> > lot).
>
> I think the page-writeout is implicated, because I do seem to need it, but
> the page-cache flush does seem to make corruption _easier_ to see. I now
> seem about to trigger it with a 100MB file on a 256MB machine in a minute
> or so, with this slight modification.
>
> I still don't see _why_, though. But maybe smarter people than me can see
> it..
FWIW this program definitely triggers the bug for me.
On Wed, 27 Dec 2006, David Miller wrote:
> >
> > I still don't see _why_, though. But maybe smarter people than me can see
> > it..
>
> FWIW this program definitely triggers the bug for me.
Ok, now that I have something simple to do repeatable stuff with, I can
say what the pattern is.. It's not all that surprising, but it's still
worth just stating for the record.
What happens is that when I do the "packetized writes" in random order,
the _last_ write to a page occasionally just goes missing. It's not always
at the end of a page, as shown by for example:
- A whole chunk got dropped:
Chunk 2094 corrupted (0-1459) (1624-3083)
Expected 46, got 0
Written as (30912)55414(10000)
That "Written as (x)y(z)" line means that the corrupted chunk was
written as chunk #y, and the preceding and following chunks (that were
_not_ corrupt) on the page was written as #x and #z respectively.
In other words, the missing chunk (which is still zero) was written
much later than the ones that were ok, and never hit the disk. It's a
contiguous chunk in the middle of the page (chunks are 1460 bytes in
size)
The first line means that all bytes of the chunk (0-1459) were
corrupted, and the values in parenthesis are the offsets within a page.
In other words, this was a chunk in the _middle_ of a page.
- The missing data can also be at the beginning or ends of pages:
Beginning of the chunk missing, it was at the end of a page (page
offsets 3288-4095) and the _next_ page got written out fine:
Chunk 2126 corrupted (0-807) (3288-4095)
Expected 78, got 0
Written as (32713)55573(14301)
End of a chunk missing, it was the beginning of a page (and the
_previous_ page that contained the beginning of the chunk was written
out fine)
Chunk 2179 corrupted (1252-1459) (0-207)
Expected 131, got 0
Written as (45189)55489(15515)
Now, the reason I say this isn't surprising is that this is entirely
consistent with the dirty bit being dropped on the floor somewhere, and
likely through some interaction with the previous changes being in the
process of being written out.
Something (incorrectly) ends up deciding that it doesn't need to write the
page, since it's already written, or alternatively clears the dirty bit
too late (clears it because an _earlier_ write finished, never mind that
the new dirty data didn't make it).
I also figured out that it's not the low-memory situation that does it, it
really must be the "page_mkclean()" triggering. Becuase I can do
echo 5 > /proc/sys/vm/dirty_ratio
echo 3 > /proc/sys/vm/dirty_background_ratio
to make it clean the pages much more aggressively than the default, and I
can see corruption on my 256MB machine with just a 40MB shared file, and
70MB of memory consistently free.
So this thing is definitely giving some answers. It's NOT about low
memory, and it very much seems to be about the whole "balance_dirty_ratio"
thing. I don't think I triggered the actual low-memory stuff once in that
situation..
So I have some more data on the behaviour, but I _still_ don't see the
reason behind it. It's probably something really obvious once it's pointed
out..
[ Modified test-program that tells you where the corruption happens (and
when the missing parts were supposed to be written out) appended, in
case people care. ]
Linus
---
#include <sys/mman.h>
#include <sys/fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <time.h>
#define TARGETSIZE (100 << 20)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
static void fillmem(void *start, int nr)
{
memset(start, nr, CHUNKSIZE);
}
#define page_offset(buf, off) (0xfff & ((unsigned)(unsigned long)(buf)+(off)))
static int chunkorder[NRCHUNKS];
static int order(int nr)
{
int i;
if (nr < 0 || nr >= NRCHUNKS)
return -1;
for (i = 0; i < NRCHUNKS; i++)
if (chunkorder[i] == nr)
return i;
return -2;
}
static void checkmem(void *buf, int nr)
{
unsigned int start = ~0u, end = 0;
unsigned char c = nr, *p = buf, differs = 0;
int i;
for (i = 0; i < CHUNKSIZE; i++) {
unsigned char got = *p++;
if (got != c) {
if (i < start)
start = i;
if (i > end)
end = i;
differs = got;
}
}
if (start < end) {
printf("Chunk %d corrupted (%u-%u) (%u-%u) \n", nr, start, end,
page_offset(buf, start), page_offset(buf, end));
printf("Expected %u, got %u\n", c, differs);
printf("Written as (%d)%d(%d)\n", order(nr-1), order(nr), order(nr+1));
}
}
static char *remap(int fd, char *mapping)
{
if (mapping) {
munmap(mapping, SIZE);
posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
}
int main(int argc, char **argv)
{
char *mapping;
int fd, i;
/*
* Make some random ordering of writing the chunks to the
* memory map..
*
* Start with fully ordered..
*/
for (i = 0; i < NRCHUNKS; i++)
chunkorder[i] = i;
/* ..and then mix it up randomly */
srandom(time(NULL));
for (i = 0; i < NRCHUNKS; i++) {
int index = (unsigned int) random() % NRCHUNKS;
int nr = chunkorder[index];
chunkorder[index] = chunkorder[i];
chunkorder[i] = nr;
}
fd = open("mapfile", O_RDWR | O_TRUNC | O_CREAT, 0666);
if (fd < 0)
return -1;
if (ftruncate(fd, SIZE) < 0)
return -1;
mapping = remap(fd, NULL);
if (-1 == (int)(long)mapping)
return -1;
for (i = 0; i < NRCHUNKS; i++) {
int chunk = chunkorder[i];
printf("Writing chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
fillmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
/* Unmap, drop, and remap.. */
mapping = remap(fd, mapping);
/* .. and check */
for (i = 0; i < NRCHUNKS; i++) {
int chunk = i;
printf("Checking chunk %d/%d (%d%%) \r", i, NRCHUNKS, 100*i/NRCHUNKS);
checkmem(mapping + chunk * CHUNKSIZE, chunk);
}
printf("\n");
return 0;
}
On 12/27/06, Linus Torvalds <[email protected]> wrote:
> [ Modified test-program that tells you where the corruption happens (and
> when the missing parts were supposed to be written out) appended, in
> case people care. ]
For the record, this is the output from a run on our ARM machine (32
MB RAM) with 2.6.18 + the following patches:
mm: tracking shared dirty pages
mm: balance dirty pages
mm: optimize the new mprotect() code a bit
mm: small cleanup of install_page()
mm: fixup do_wp_page()
mm: msync() cleanup
It is at all suprising that the second offset within a page can be
less than the first offset within a page ? e.g.
Chunk 260 corrupted (1-1455) (2769-127)
$ ./linus-test
Writing chunk 279/280 (99%)
Chunk 256 corrupted (1-1455) (1025-2479)
Expected 0, got 1
Written as (82)175(56)
Chunk 258 corrupted (1-1455) (3945-1303)
Expected 2, got 3
Written as (56)51(20)
Chunk 260 corrupted (1-1455) (2769-127)
Expected 4, got 5
Written as (20)30(18)
Chunk 262 corrupted (1-1455) (1593-3047)
Expected 6, got 7
Written as (18)196(158)
Chunk 264 corrupted (1-1455) (417-1871)
Expected 8, got 9
Written as (158)133(146)
Chunk 266 corrupted (1-1455) (3337-695)
Expected 10, got 11
Written as (146)43(77)
Chunk 268 corrupted (1-1455) (2161-3615)
Expected 12, got 13
Written as (77)251(211)
Chunk 270 corrupted (1-1455) (985-2439)
Expected 14, got 15
Written as (211)257(231)
Chunk 272 corrupted (1-1455) (3905-1263)
Expected 16, got 17
Written as (231)254(154)
Chunk 274 corrupted (1-1455) (2729-87)
Expected 18, got 19
Written as (154)11(85)
Chunk 276 corrupted (1-1455) (1553-3007)
Expected 20, got 21
Written as (85)230(134)
Chunk 278 corrupted (1-1455) (377-1831)
Expected 22, got 23
Written as (134)233(103)
Checking chunk 279/280 (99%)
Gordon
--
Gordon Farquharson
On Wed, 27 Dec 2006, Gordon Farquharson wrote:
>
> It is at all suprising that the second offset within a page can be
> less than the first offset within a page ? e.g.
>
> Chunk 260 corrupted (1-1455) (2769-127)
No, that just means that it went over to the next page (so you actually
had two consecutive pages that weren't written out).
That said, your output is very different from mine in another way. You
don't have zeroes in your pages, rather the thing seems to have data from
the next block (ie the chunk that should have 20 is reported as having 21
etc). You also have your offsets shifted up by one (ie offset 0 looks ok
for you, and then you have a strange pattern of corruption at bytes
1...1455 instead of 0..1459.
You also seem to have an example of the _earlier_ writes being corrupted,
rather than the later ones. For example (but it's also a page-crosser, so
maybe that's part of it):
Chunk 274 corrupted (1-1455) (2729-87)
Expected 18, got 19
Written as (154)11(85)
says that block chunk 274 is the corrupt one, but it was written fairly
early as #11, and the blocks around it (chunks 273 and 275) were actually
written later.
For all I know, my test-program is buggy wrt the ordering printouts,
though. Did you perhaps change the logic in any way?
Linus
[Oops - forgot to hit "Reply to All" first time round.]
Hi Linus
On 12/27/06, Linus Torvalds <[email protected]> wrote:
> For all I know, my test-program is buggy wrt the ordering printouts,
> though. Did you perhaps change the logic in any way?
I don't think so. I did reduce the target size
#define TARGETSIZE (100 << 12)
to make the program finish a little quicker, and for some reason I get
linus-test.c: In function 'remap':
linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
this function)
when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
as defined in /usr/include/bits/fcntl.h.
Other than these two changes, the program is identical to the version
you posted.
I have run the program a few times, and the output is pretty
consistent. However, when I increase the target size, the difference
between the expected and actual values is larger.
Written as (749)935(738)
Chunk 1113 corrupted (1-1455) (2965-323)
Expected 89, got 93
Written as (935)738(538)
Chunk 1114 corrupted (1-1455) (329-1783)
Expected 90, got 94
Written as (738)538(678)
Chunk 1115 corrupted (1-1455) (1789-3243)
Expected 91, got 95
Written as (538)678(989)
Chunk 1120 corrupted (1-1455) (897-2351)
Expected 96, got 100
Written as (537)265(1005)
Chunk 1121 corrupted (1-1455) (2357-3811)
Expected 97, got 101
Written as (265)1005(-1)
--- linus-test.c.orig 2006-12-28 06:17:24.000000000 +0100
+++ linus-test.c 2006-12-28 06:18:24.000000000 +0100
@@ -6,7 +6,7 @@
#include <stdio.h>
#include <time.h>
-#define TARGETSIZE (100 << 20)
+#define TARGETSIZE (100 << 14)
#define CHUNKSIZE (1460)
#define NRCHUNKS (TARGETSIZE / CHUNKSIZE)
#define SIZE (NRCHUNKS * CHUNKSIZE)
@@ -61,7 +61,7 @@
{
if (mapping) {
munmap(mapping, SIZE);
- posix_fadvise(fd, 0, SIZE, POSIX_FADV_DONTNEED);
+ posix_fadvise(fd, 0, SIZE, 4);
}
return mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
Gordon
--
Gordon Farquharson
From: "Gordon Farquharson" <[email protected]>
Date: Wed, 27 Dec 2006 22:20:20 -0700
> and for some reason I get
>
> linus-test.c: In function 'remap':
> linus-test.c:61: error: 'POSIX_FADV_DONTNEED' undeclared (first use in
> this function)
>
> when I compile the program, so I replaced POSIX_FADV_DONTNEED with 4
> as defined in /usr/include/bits/fcntl.h.
Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.
Perhaps Linus's GCC sets that by default and our's doesn't.
Hi David
On 12/27/06, David Miller <[email protected]> wrote:
> Me too, I added "-D_POSIX_C_SOURCE=200112" to "fix" this.
That works for me. Thanks for the tip.
Gordon
--
Gordon Farquharson
Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> On Wed, 27 Dec 2006, David Miller wrote:
> > >
> > > I still don't see _why_, though. But maybe smarter people than me can see
> > > it..
> >
> > FWIW this program definitely triggers the bug for me.
>
> Ok, now that I have something simple to do repeatable stuff with, I can
> say what the pattern is.. It's not all that surprising, but it's still
> worth just stating for the record.
Running the test code, git bisect points its finger at this commit. Reverting
this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
Author: Peter Zijlstra <[email protected]>
Date: Mon Sep 25 23:30:58 2006 -0700
[PATCH] mm: balance dirty pages
Now that we can detect writers of shared mappings, throttle them. Avoids OOM
by surprise.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > On Wed, 27 Dec 2006, David Miller wrote:
> > > >
> > > > I still don't see _why_, though. But maybe smarter people than me can see
> > > > it..
> > >
> > > FWIW this program definitely triggers the bug for me.
> >
> > Ok, now that I have something simple to do repeatable stuff with, I can
> > say what the pattern is.. It's not all that surprising, but it's still
> > worth just stating for the record.
>
>
> Running the test code, git bisect points its finger at this commit. Reverting
> this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
>
> edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> Author: Peter Zijlstra <[email protected]>
> Date: Mon Sep 25 23:30:58 2006 -0700
>
> [PATCH] mm: balance dirty pages
>
> Now that we can detect writers of shared mappings, throttle them. Avoids OOM
> by surprise.
Oh, never mind :-( I just didn't create enough write out pressure when
test this. I just saw bug got triggered on a kernel I previously thought
was OK.
From: "Chen, Kenneth W" <[email protected]>
Date: Wed, 27 Dec 2006 22:10:52 -0800
> Chen, Kenneth wrote on Wednesday, December 27, 2006 9:55 PM
> > Linus Torvalds wrote on Wednesday, December 27, 2006 7:05 PM
> > > On Wed, 27 Dec 2006, David Miller wrote:
> > > > >
> > > > > I still don't see _why_, though. But maybe smarter people than me can see
> > > > > it..
> > > >
> > > > FWIW this program definitely triggers the bug for me.
> > >
> > > Ok, now that I have something simple to do repeatable stuff with, I can
> > > say what the pattern is.. It's not all that surprising, but it's still
> > > worth just stating for the record.
> >
> >
> > Running the test code, git bisect points its finger at this commit. Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> >
> > edc79b2a46ed854595e40edcf3f8b37f9f14aa3f is first bad commit
> > commit edc79b2a46ed854595e40edcf3f8b37f9f14aa3f
> > Author: Peter Zijlstra <[email protected]>
> > Date: Mon Sep 25 23:30:58 2006 -0700
> >
> > [PATCH] mm: balance dirty pages
> >
> > Now that we can detect writers of shared mappings, throttle them. Avoids OOM
> > by surprise.
>
>
> Oh, never mind :-( I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.
Besides, I'm pretty sure that from the Debian bug entry it's been
established that the dirty-page tracking changes from a few releases
ago introduced this problem.
On Wed, 2006-12-27 at 19:04 -0800, Linus Torvalds wrote:
>
> On Wed, 27 Dec 2006, David Miller wrote:
> > >
> > > I still don't see _why_, though. But maybe smarter people than me can see
> > > it..
> >
> > FWIW this program definitely triggers the bug for me.
>
> Ok, now that I have something simple to do repeatable stuff with, I can
> say what the pattern is.. It's not all that surprising, but it's still
> worth just stating for the record.
>
> What happens is that when I do the "packetized writes" in random order,
> the _last_ write to a page occasionally just goes missing. It's not always
> at the end of a page, as shown by for example:
>
> - A whole chunk got dropped:
>
> Chunk 2094 corrupted (0-1459) (1624-3083)
> Expected 46, got 0
> Written as (30912)55414(10000)
>
> That "Written as (x)y(z)" line means that the corrupted chunk was
> written as chunk #y, and the preceding and following chunks (that were
> _not_ corrupt) on the page was written as #x and #z respectively.
>
> In other words, the missing chunk (which is still zero) was written
> much later than the ones that were ok, and never hit the disk. It's a
> contiguous chunk in the middle of the page (chunks are 1460 bytes in
> size)
>
> The first line means that all bytes of the chunk (0-1459) were
> corrupted, and the values in parenthesis are the offsets within a page.
> In other words, this was a chunk in the _middle_ of a page.
>
> - The missing data can also be at the beginning or ends of pages:
>
> Beginning of the chunk missing, it was at the end of a page (page
> offsets 3288-4095) and the _next_ page got written out fine:
>
> Chunk 2126 corrupted (0-807) (3288-4095)
> Expected 78, got 0
> Written as (32713)55573(14301)
>
> End of a chunk missing, it was the beginning of a page (and the
> _previous_ page that contained the beginning of the chunk was written
> out fine)
>
> Chunk 2179 corrupted (1252-1459) (0-207)
> Expected 131, got 0
> Written as (45189)55489(15515)
>
> Now, the reason I say this isn't surprising is that this is entirely
> consistent with the dirty bit being dropped on the floor somewhere, and
> likely through some interaction with the previous changes being in the
> process of being written out.
>
> Something (incorrectly) ends up deciding that it doesn't need to write the
> page, since it's already written, or alternatively clears the dirty bit
> too late (clears it because an _earlier_ write finished, never mind that
> the new dirty data didn't make it).
There might be a narrow race between set_page_dirty and clear_page_dirty.
The test program is a process to write/read data. pdflush might write data
to disk asynchronously. After pdflush writes a page to disk, it will call (either by
softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
notification. But just after the page is written to disk and before the interrupt
reports the result, the test process might change the data and unmap the area. When
the area is unmapped, the page is marked as dirty again, but just after that, the
interrupt arrives and the dirty bit is cleared, so the late data will not be written
to disk.
Function zap_pte_range checks pte to set page dirty if needed, but it doesn't
hold page lock. If the page lock is held before set page dirty, the race might
be avoided.
Yanmin
On Wed, Dec 27, 2006 at 10:20:20PM -0700, Gordon Farquharson wrote:
> I have run the program a few times, and the output is pretty
> consistent. However, when I increase the target size, the difference
> between the expected and actual values is larger.
>
> Written as (749)935(738)
> Chunk 1113 corrupted (1-1455) (2965-323)
> Expected 89, got 93
This is not the corruption Linus is after. Note that the corruption starts
at offset '1'. Also note that:
89 = 1113 & 255
93 = 1113 & 255 | (1113 >> 8)
and if you look at glibc's memset() function, you'll notice that's exactly
what you expect if you pass a non-8bit value to it. Ergo, what you're
seeing is utterly expected given glibc's memset() implementation on ARM.
Fixing Linus' test program to pass nr & 255 to memset results in clean
passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
machines (as would be expected.)
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Wed, Dec 27, 2006 at 07:04:34PM -0800, Linus Torvalds wrote:
> [ Modified test-program that tells you where the corruption happens (and
> when the missing parts were supposed to be written out) appended, in
> case people care. ]
Hi
2.6.18 (and 2.6.18.6) is ok, 2.6.19-rc1 is broken. I tried some snapshots
between them but they hung before shell (2.6.18-git11, 2.6.18-git16,
2.6.18-git20, 2.6.18-git21). 2.6.18-git22 booted and was broken.
(UP, no preempt)
-Petri
On 12/28/06, Russell King <[email protected]> wrote:
> Fixing Linus' test program to pass nr & 255 to memset results in clean
> passes on 2.6.9 on TheCus N2100 (IOP8032x) and 2.6.16.9 StrongARM
> machines (as would be expected.)
Thanks for the fix, Russell.
I can now trigger the (real) problem by using a 25 MB file (100 << 18)
and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).
$ ./linus-test
Writing chunk 17954/17955 (99%)
Chunk 514 corrupted (0-1459) (872-2331)
Expected 2, got 0
Written as (8479)11160(10312)
Chunk 516 corrupted (0-303) (3792-4095)
Expected 4, got 0
Written as (10312)10569(4426)
Chunk 959 corrupted (0-691) (3404-4095)
Expected 191, got 0
Written as (687)4881(1522)
Chunk 1895 corrupted (0-1459) (1900-3359)
Expected 103, got 0
Written as (7746)8389(6231)
Chunk 2702 corrupted (0-1459) (472-1931)
Expected 142, got 0
Written as (4866)7103(2409)
Chunk 3314 corrupted (0-1459) (1064-2523)
Expected 242, got 0
Written as (4287)7064(1730)
Chunk 4043 corrupted (0-1459) (444-1903)
Expected 203, got 0
Written as (6495)8509(4464)
Chunk 5180 corrupted (0-1459) (1584-3043)
Expected 60, got 0
Written as (11056)12826(10797)
Chunk 5672 corrupted (0-991) (3104-4095)
Expected 40, got 0
Written as (9944)4872(41)
Chunk 5793 corrupted (460-1459) (0-999)
Expected 161, got 0
Written as (7059)5038(4377)
Chunk 6089 corrupted (0-1459) (1620-3079)
Expected 201, got 0
Written as (4672)5230(4403)
Chunk 6545 corrupted (268-1459) (0-1191)
Expected 145, got 0
Written as (3701)5969(4668)
Chunk 7578 corrupted (0-1459) (584-2043)
Expected 154, got 0
Written as (10015)5082(1648)
Chunk 7880 corrupted (864-1459) (0-595)
Expected 200, got 0
Written as (17869)5064(4745)
Chunk 8086 corrupted (0-1459) (888-2347)
Expected 150, got 0
Written as (10206)11050(10374)
Chunk 8749 corrupted (0-1459) (2212-3671)
Expected 45, got 0
Written as (15263)7132(4825)
Chunk 9068 corrupted (0-1459) (1008-2467)
Expected 108, got 0
Written as (5557)7571(6771)
Chunk 9193 corrupted (812-1459) (0-647)
Expected 233, got 0
Written as (9238)7277(4757)
Chunk 10032 corrupted (576-1459) (0-883)
Expected 48, got 0
Written as (15741)10012(1753)
Chunk 10056 corrupted (0-1459) (1696-3155)
Expected 72, got 0
Written as (5379)7431(262)
Chunk 10395 corrupted (0-1459) (1020-2479)
Expected 155, got 0
Written as (21)7442(5902)
Chunk 10791 corrupted (0-1459) (1644-3103)
Expected 39, got 0
Written as (4753)5925(5926)
Chunk 10792 corrupted (0-991) (3104-4095)
Expected 40, got 0
Written as (5925)5926(8555)
Chunk 11036 corrupted (0-1103) (2992-4095)
Expected 28, got 0
Written as (13755)14449(7458)
Chunk 11387 corrupted (644-1459) (0-815)
Expected 123, got 0
Written as (10853)11459(9445)
Chunk 11586 corrupted (920-1459) (0-539)
Expected 66, got 0
Written as (3769)11691(11123)
Chunk 11882 corrupted (0-1459) (1160-2619)
Expected 106, got 0
Written as (10736)11696(2788)
Chunk 12397 corrupted (0-603) (3492-4095)
Expected 109, got 0
Written as (2352)7515(2437)
Chunk 12669 corrupted (0-795) (3300-4095)
Expected 125, got 0
Written as (1191)7661(5266)
Chunk 13162 corrupted (0-1459) (2184-3643)
Expected 106, got 0
Written as (9383)13662(11544)
Chunk 14653 corrupted (0-27) (4068-4095)
Expected 61, got 0
Written as (8100)9456(1275)
Chunk 17332 corrupted (0-367) (3728-4095)
Expected 180, got 0
Written as (760)12247(1244)
Chunk 17445 corrupted (0-1459) (772-2231)
Expected 37, got 0
Written as (8007)16481(14439)
Chunk 17556 corrupted (0-1007) (3088-4095)
Expected 148, got 0
Written as (10113)10657(10477)
Chunk 17859 corrupted (0-995) (3100-4095)
Expected 195, got 0
Written as (14472)14767(11426)
Checking chunk 17954/17955 (99%)
Gordon
--
Gordon Farquharson
I set a qemu environment to test kernels: http://guichaz.free.fr/linux-bug/
I have corruption with every Fedora release kernel except the first, that is
2.4.22 works, but 2.6.5, 2.6.9, 2.6.11, 2.6.15 and 2.6.18-1.2798 exhibit
some
corruption.
Command line to test:
qemu root_fs -snapshot -kernel FC-kernels/FC2-vmlinuz-2.6.5-1.358 -append 'rw root=/dev/hda'
I get this kind of corruption:
http://guichaz.free.fr/linux-bug/corruption.png
--
Guillaume
* Gordon Farquharson <[email protected]> [2006-12-28 07:15]:
> Thanks for the fix, Russell.
>
> I can now trigger the (real) problem by using a 25 MB file (100 << 18)
> and the Linksys NSLU2 (ARM, IXP420 processor, 32 MB RAM).
Me too (using 100 << 18). Interestingly, I don't seem to get any
corruption on a different ARM board, an IOP32x based machine with 128
MB RAM.
--
Martin Michlmayr
http://www.cyrius.com/
On Wed, 27 Dec 2006, Chen, Kenneth W wrote:
> >
> > Running the test code, git bisect points its finger at this commit. Reverting
> > this commit on top of 2.6.20-rc2 doesn't trigger the bug from the test code.
> >
> > [PATCH] mm: balance dirty pages
> >
> > Now that we can detect writers of shared mappings, throttle them. Avoids OOM
> > by surprise.
>
> Oh, never mind :-( I just didn't create enough write out pressure when
> test this. I just saw bug got triggered on a kernel I previously thought
> was OK.
Btw, this is an important point - people have long felt that the new page
balancing in 2.6.19 was to blame, but you've just confirmed the long-held
suspicion (at least by me) that it's not actually a new bug at all, it's
just that the dirty page balancing causes writeback to happen _earlier_,
and thus is better able to _show_ a bug that we've likely had for a long
long time.
Linus
On Thu, 28 Dec 2006, Zhang, Yanmin wrote:
>
> The test program is a process to write/read data. pdflush might write data
> to disk asynchronously. After pdflush writes a page to disk, it will call (either by
> softirq) clear_page_dirty to clear the dirty bit after getting the interrupt
> notification.
That would indeed be a horrible bug. However, we don't do
"clear_page_dirty()" _after_ the IO has completed, we do it _before_ the
IO starts.
If you can actually find a place that does clear_page_dirty _after_ IO,
then yes, you've just found the bug. But I haven't found it.
So the rule is _always_:
- call "clear_page_dirty_for_io()" with the page lock held, and _before_
the IO starts.
- do "set_page_writeback()" before unlocking the page again
- do a "end_page_writeback()" when the IO actually finishes.
and any code sequence that doesn't honor those rules would be buggy. A
beer for anybody that finds it..
Linus
On Thu, 28 Dec 2006, Russell King wrote:
>
> and if you look at glibc's memset() function, you'll notice that's exactly
> what you expect if you pass a non-8bit value to it. Ergo, what you're
> seeing is utterly expected given glibc's memset() implementation on ARM.
Guys, you _really_ should fix memset(). What you describe is a _bug_.
"memset()" takes an "int" as its argument (always has), and has to convert
it to a byte _itself_. It may not be common, but it's perfectly normal, to
pass it values outside 0-255 (negative values that still fit in a "signed
char" in particular are very normal, but my usage of "let the thing
truncate it itself" is also quite fine).
> Fixing Linus' test program to pass nr & 255 to memset
No. I'm almost certain that that is not a "fix", it's a workaround for a
serious bug in your glibc crap.
But it does explain all the unexpected strange behaviour (and the really
small writeback size - now it doesn't need any /proc/sys/vm/dirty_ratio
assumptions to be explicable.
Linus
On Tue, Dec 19, 2006 at 09:51:49AM +0100, Marc Haber wrote:
> On Sun, Dec 17, 2006 at 09:43:08PM -0800, Andrew Morton wrote:
> > Six hours here of fsx-linux plus high memory pressure on SMP on 1k
> > blocksize ext3, mainline. Zero failures. It's unlikely that this testing
> > would pass, yet people running normal workloads are able to easily trigger
> > failures. I suspect we're looking in the wrong place.
>
> I do not have a clue about memory management at all, but is it
> possible that you're testing on a box with too much memory? My box has
> only 256 MB, and I used to use mutt with a _huge_ inbox with mutt
> taking somewhat 150 MB. Add spamassassin and a reasonably busy mail
> server, and the box used to be like 150 MB in swap.
>
> I have tidied my inbox in the mean time and mutt's memory requirement
> has been reduced to somewhat 30 MB, which might be the cause that I
> don't see the issue that often any more.
After being up for ten days, I have now encountered the file
corruption of pkgcache.bin for the first time again. The 256 MB i386
box is like 26M in swap, is under very moderate load.
I am running plain vanilla 2.6.19.1. Is there a patch that I should
apply against 2.6.19.1 that would help in debugging?
Greetings
Marc
--
-----------------------------------------------------------------------------
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Mannheim, Germany | lose things." Winona Ryder | Fon: *49 621 72739834
Nordisch by Nature | How to make an American Quilt | Fax: *49 621 72739835
On Thu, Dec 28, 2006 at 09:27:12AM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Russell King wrote:
> > and if you look at glibc's memset() function, you'll notice that's exactly
> > what you expect if you pass a non-8bit value to it. Ergo, what you're
> > seeing is utterly expected given glibc's memset() implementation on ARM.
>
> Guys, you _really_ should fix memset(). What you describe is a _bug_.
Yup, but I have nothing to do with glibc because I refuse to do that
silly copyright assignment FSF thing. Hopefully someone else can
resolve it, but...
> > Fixing Linus' test program to pass nr & 255 to memset
>
> No. I'm almost certain that that is not a "fix", it's a workaround for a
> serious bug in your glibc crap.
_is_ a fix whether _you_ like it or not to work around the issue so
people can at least run your test program. I'm not saying it's a
proper fix though.
Of course, if you prefer to be mislead by incorrect bug reports...
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Thu, 28 Dec 2006, Marc Haber wrote:
>
> After being up for ten days, I have now encountered the file
> corruption of pkgcache.bin for the first time again. The 256 MB i386
> box is like 26M in swap, is under very moderate load.
>
> I am running plain vanilla 2.6.19.1. Is there a patch that I should
> apply against 2.6.19.1 that would help in debugging?
Not right now.
And I have a test-program that shows the corruption _much_ easier (at
least according to my own testing, and that of several reporters that back
me up), and that seems to show the corruption going way way back (ie going
back to Linux-2.6.5 at least, according to one tester).
So it just got a lot _easier_ to trigger in 2.6.19, but it's not a new
bug.
What we need now is actually looking at the source code, and people who
understand the VM, I'm afraid. I'm gathering traces now that I have a good
test-case. I'll post my trace tools once I've tested that they work, in
case others want to help.
(And hey, you don't have to be a VM expert to help: this could be a
learning experience. However, I'll warn you: this is _the_ most grotty
part of the whole kernel. It's not even ugly, it's just damn hard and
complex).
Linus
On Thu, 28 Dec 2006, Russell King wrote:
>
> Yup, but I have nothing to do with glibc because I refuse to do that
> silly copyright assignment FSF thing. Hopefully someone else can
> resolve it, but...
Yeah, me too.
> _is_ a fix whether _you_ like it or not to work around the issue so
> people can at least run your test program. I'm not saying it's a
> proper fix though.
My point was that it wasn't a "fix", it's a "workaround". The _fix_ would
be in glibc.
Nothing more..
Linus
On Thu, Dec 28, 2006 at 11:00:46AM -0800, Linus Torvalds wrote:
> And I have a test-program that shows the corruption _much_ easier (at
> least according to my own testing, and that of several reporters that back
> me up), and that seems to show the corruption going way way back (ie going
> back to Linux-2.6.5 at least, according to one tester).
That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
(or older)?
Guillaume Chazarain a ?crit :
> I get this kind of corruption:
> http://guichaz.free.fr/linux-bug/corruption.png
Actually in qemu, I get three different behaviours:
- no corruption at all : with linux-2.4
- corruption only on the first chunks: before [PATCH] mm: balance dirty
pages as identified by Kenneth
- corruption of all chunks: after the balance dirty pages patch
Bisecting in linux-2.5 land I found
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.5/2.5.66/2.5.66-mm3/broken-out/fadvise-flush-data.patch
to cause the corruption for me.
The attached patch fixes the corruption for me.
--
Guillaume
On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
> > me up), and that seems to show the corruption going way way back (ie going
> > back to Linux-2.6.5 at least, according to one tester).
>
> That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> (or older)?
Well, that was a really _old_ fedora kernel. I guarantee you it didn't
have the page throttling patches in it, those were written this summer. So
it would either have to be Fedora carrying around another patch that just
happens to result in the same corruption for _years_, or it's the same
bug.
I bet it's the same bug, and it's been around for ages.
Linus
On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
>
> The attached patch fixes the corruption for me.
Well, that's a good hint, but it's really just a symptom. You effectively
just made the test-program not even try to flush the data to disk, so the
page cache would stay in memory, and you'd not see the corruption as well.
So you basically disabled the code that tried to trigger the bug more
easily.
But the reason I say it's interesting is that "WB_SYNC_NONE" is very much
implicated in mm/page-writeback.c, and if there is a bug triggered by
WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also
fails..
Linus
On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
> > > me up), and that seems to show the corruption going way way back (ie going
> > > back to Linux-2.6.5 at least, according to one tester).
> >
> > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> > (or older)?
>
> Well, that was a really _old_ fedora kernel. I guarantee you it didn't
> have the page throttling patches in it, those were written this summer. So
> it would either have to be Fedora carrying around another patch that just
> happens to result in the same corruption for _years_, or it's the same
> bug.
The only notable VM patch in Fedora kernels of that vintage that I recall
was Ingo's 4g/4g thing.
Dave
--
http://www.codemonkey.org.uk
On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> >
> > The attached patch fixes the corruption for me.
>
> Well, that's a good hint, but it's really just a symptom. You effectively
> just made the test-program not even try to flush the data to disk, so the
> page cache would stay in memory, and you'd not see the corruption as well.
>
> So you basically disabled the code that tried to trigger the bug more
> easily.
>
> But the reason I say it's interesting is that "WB_SYNC_NONE" is very much
> implicated in mm/page-writeback.c, and if there is a bug triggered by
> WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also
> fails..
>
It would be interesting to convert your app to do fsync() before
FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well
(apart from pdflush activity).
On Thu, 2006-12-28 at 14:39 -0500, Dave Jones wrote:
> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
> >
> >
> > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
> > > > me up), and that seems to show the corruption going way way back (ie going
> > > > back to Linux-2.6.5 at least, according to one tester).
> > >
> > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> > > (or older)?
> >
> > Well, that was a really _old_ fedora kernel. I guarantee you it didn't
> > have the page throttling patches in it, those were written this summer. So
> > it would either have to be Fedora carrying around another patch that just
> > happens to result in the same corruption for _years_, or it's the same
> > bug.
>
> The only notable VM patch in Fedora kernels of that vintage that I recall
> was Ingo's 4g/4g thing.
which does tlb flushes *all the time* so that even rules out (well
almost) a stale tlb somewhere...
On Thu, 28 Dec 2006, Andrew Morton wrote:
>
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).
I get corruption - but the whole point is that it's very much pdflush that
should be writing these pages out.
Andrew - give my test-program a try. It can run in about 1 minute if you
have a 256MB machine (I didn't, but "mem=256M" is my friend), and it seems
to very consistently cause corruption.
What I do is:
# Make sure we write back aggressively
echo 5 > /proc/sys/vm/dirty_ratio
as root, and then just run the thing. Tons of corruption. But the
corruption goes away if I just leave the default dirty ratio alone (but
then I can increse the file size to trigger it, of course - but that
also makes the test run a lot slower).
Now, with a pre-2.6.19 kernel, I bet you won't get the corruption as
easily (at least with the "fsync()"), but that's less to do with anything
new, and probably just because then you simply won't have any pdflushing
going on - since the kernel won't even notice that you have tons of dirty
pages ;)
It might also depend on the speed of your disk drive - the machine I test
this on has a slow 4200 rpm laptop drive in it, and that probably makes
things go south more easily. That's _especially_ true if this is related
to any "bdi_write_congested()" logic.
Now, it could also be related to various code snippets like
...
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
...
where the WB_SYNC_NONE case will hit the "PageWriteback()" and just not do
the writeback at all (but it also won't clear the dirty bit, so it's
certainly not an *OBVIOUS* bug).
We also have code like this ("pageout()"):
if (clear_page_dirty_for_io(page)) {
int res;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
..
}
...
res = mapping->a_ops->writepage(page, &wbc);
and in this case, if the "WB_SYNC_NONE" means that the "writepage()" call
won't do anything at all because of congestion, then that would be a _bad_
thing, and would certainly explain how something didn't get written out.
But that particular path should only trigger for the "shrink_page_list()"
case, and it's not the case I seem to be testing with my "low dirty_ratio"
testing.
Linus
On Thu, 28 Dec 2006, Linus Torvalds wrote:
>
> What we need now is actually looking at the source code, and people who
> understand the VM, I'm afraid. I'm gathering traces now that I have a good
> test-case. I'll post my trace tools once I've tested that they work, in
> case others want to help.
Ok, I've got the traces, but quite frankly, I doubt anybody is crazy
enough to want to trawl through them. It's a bit painful, since we're
talking thousands of pages to trigger this problem.
Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably
ARM, but is used for other things on ia64, powerpc and sparc64. But here's
the patch in case anybody cares.
It wants a _big_ kernel buffer to capture all the crud into (which is why
I made the thing accept a bigger log buffer), and quite frankly, I'm not
at all sure that all the locking is ok (ie I could imagine that the
dcache-locking thing there in "is_interesting()" could deadlock, what do I
know..)
But I've captured some real data with this, which I'll describe
separately.
Linus
----
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..967dd80 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,8 @@
#define PG_nosave_free 18 /* Used for system suspend/resume */
#define PG_buddy 19 /* Page is free, on buddy lists */
+#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags)
+#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags)
#if (BITS_PER_LONG > 32)
/*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5c26818..7735b83 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -79,7 +79,7 @@ config DEBUG_KERNEL
config LOG_BUF_SHIFT
int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL
- range 12 21
+ range 12 24
default 17 if S390 || LOCKDEP
default 16 if X86_NUMAQ || IA64
default 15 if SMP
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..d6a0f56 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+if (PageInteresting(page)) printk("Removing index %08x from page cache\n", page->index);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
@@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping,
return err;
}
+static noinline int is_interesting(struct address_space *mapping)
+{
+ struct inode *inode = mapping->host;
+ struct dentry *dentry;
+ int retval = 0;
+
+ spin_lock(&dcache_lock);
+ list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+ if (strcmp(dentry->d_name.name, "mapfile"))
+ continue;
+ retval = 1;
+ break;
+ }
+ spin_unlock(&dcache_lock);
+ return retval;
+}
+
/**
* add_to_page_cache - add newly allocated pagecache pages
* @page: page to add
@@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping,
{
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ if (is_interesting(mapping))
+ SetPageInteresting(page);
+
if (error == 0) {
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..14c9815 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -667,6 +667,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
+if (PageInteresting(page))
+ printk("Unmapped index %08x at %08x\n", page->index, addr);
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
addr) != page->index)
@@ -1605,6 +1607,7 @@ gotten:
*/
ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
+if (PageInteresting(new_page)) printk("do_wp_page: mapping index %08x at %08lx\n", new_page->index, address);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
@@ -2249,6 +2252,7 @@ retry:
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+if (PageInteresting(new_page)) printk("do_no_page: mapping index %08x at %08lx (%s)\n", new_page->index, address, write_access ? "write" : "read");
set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..0466601 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -813,6 +813,7 @@ int fastcall set_page_dirty(struct page *page)
if (!spd)
spd = __set_page_dirty_buffers;
#endif
+if (PageInteresting(page)) printk("Setting page %08x dirty\n", page->index);
return (*spd)(page);
}
if (!PageDirty(page)) {
@@ -867,6 +868,7 @@ int clear_page_dirty_for_io(struct page *page)
if (TestClearPageDirty(page)) {
if (mapping_cap_account_dirty(mapping)) {
+if (PageInteresting(page)) printk("cpd_for_io: index %08x\n", page->index);
page_mkclean(page);
dec_zone_page_state(page, NR_FILE_DIRTY);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 57306fa..e98e84c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;
+if (PageInteresting(page)) printk("cleaning index %08x at %08x\n", page->index, address);
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
@@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
goto out_unmap;
}
+if (PageInteresting(page)) printk("unmapping index %08x from %08lx\n", page->index, address);
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
@@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor,
if (ptep_clear_flush_young(vma, address, pte))
continue;
+if (PageInteresting(page)) printk("Cluster-unmapping %08x from %08lx\n", page->index, address);
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
On Thu, Dec 28, 2006 at 01:24:30PM -0800, Linus Torvalds wrote:
> On Thu, 28 Dec 2006, Linus Torvalds wrote:
> >
> > What we need now is actually looking at the source code, and people who
> > understand the VM, I'm afraid. I'm gathering traces now that I have a good
> > test-case. I'll post my trace tools once I've tested that they work, in
> > case others want to help.
>
> Ok, I've got the traces, but quite frankly, I doubt anybody is crazy
> enough to want to trawl through them. It's a bit painful, since we're
> talking thousands of pages to trigger this problem.
>
> Also, I've used the PG_arch_1 flag, which is fine on x86[-64] and probably
> ARM, but is used for other things on ia64, powerpc and sparc64. But here's
> the patch in case anybody cares.
PG_arch_1 is used on ARM to flag pages that need a dcache flush prior to
hitting userspace, in the same way that sparc64 uses it. So ARM systems
should not have this patch applied.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
On Thu, 2006-12-28 at 11:45 -0800, Andrew Morton wrote:
> On Thu, 28 Dec 2006 11:28:52 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
> >
> >
> > On Thu, 28 Dec 2006, Guillaume Chazarain wrote:
> > >
> > > The attached patch fixes the corruption for me.
> >
> > Well, that's a good hint, but it's really just a symptom. You effectively
> > just made the test-program not even try to flush the data to disk, so the
> > page cache would stay in memory, and you'd not see the corruption as well.
> >
> > So you basically disabled the code that tried to trigger the bug more
> > easily.
> >
> > But the reason I say it's interesting is that "WB_SYNC_NONE" is very much
> > implicated in mm/page-writeback.c, and if there is a bug triggered by
> > WB_SYNC_NONE writebacks, then that would explain why page-writeback.c also
> > fails..
> >
>
> It would be interesting to convert your app to do fsync() before
> FADV_DONTNEED. That would take WB_SYNC_NONE out of the picture as well
> (apart from pdflush activity).
I did fdatasync(), tried remapping before unmapping... nogo here.
From: Linus Torvalds <[email protected]>
Date: Thu, 28 Dec 2006 12:14:31 -0800 (PST)
> I get corruption - but the whole point is that it's very much pdflush that
> should be writing these pages out.
I think what might be happening is that pdflush writes them out fine,
however we don't trap writes by the application _during_ that writeout.
These corruptions look exactly as if:
1) pdflush begins writeback of page X
2) page goes to disk
3) application writes a chunk to the page
4) pdflush et al. think the page is clean, so it gets tossed, losing
the writes done in #3
So there's a missing PTE change in there, so that we never get proper
re-dirtying of the page if the application tries to write to the page
during the writeback.
It's something that will only occur with writeback and MAP_SHARED
writable access to the file pages. That's why we never see this
with normal filesystem writes, since those explicitly manage the
page dirty state.
I think the dirty balancing logic etc. isn't where the problems are,
to me it's a PTE state update issue for sure.
Ok,
with the ugly trace capture patch, I've actually captured this corruption
in action, I think.
I did a full trace of all pages involved in one run, and picked one
corruption at random:
Chunk 14465 corrupted (0-75) (01423fb4-01423fff)
Expected 129, got 0
Written as (5126)9509(15017)
That's the first 76 bytes of a chunk missing, and it's the last 76 bytes
on a page. It's page index 01423 in the mapped file, and bytes fb4-fff
within that file.
There were four chunks written to that page:
Writing chunk 14463/15800 (15%) (0142344c) (1)
Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423)
Writing chunk 14464/15800 (32%) (01423a00) (3)
Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST!
and the other three chunks checked out all right.
And here's the annotated trace as it concerns that page:
- here we write the first chunk to the page:
** (1) do_no_page: mapping index 00001423 at b7d1f44c (write)
** Setting page 00001423 dirty
- something flushes it out to disk:
** cpd_for_io: index 00001423
** cleaning index 00001423 at b7d1f000
- here we write the second chunk (which was split over the previous page
and the interesting one):
** (2) Setting page 00001422 dirty
** (2) Setting page 00001423 dirty
- and here we do a cleaning event
** cpd_for_io: index 00001423
** cleaning index 00001423 at b7d1f000
- here we write the third chunk:
** (3) Setting page 00001423 dirty
- here we write the fourth chunk:
** (4) NO DIRTY EVENT
- and a third flush to disk:
** cpd_for_io: index 00001423
** cleaning index 00001423 at b7d1f000
- here we unmap and flush:
** Unmapped index 00001423 at b7d1f000
** Removing index 00001423 from page cache
- here we remap to check:
** do_no_page: mapping index 00001423 at b7d1f000 (read)
** Unmapped index 00001423 at b7d1f000
- and finally, here I remove the file after the run:
** Removing index 00001423 from page cache
Now, the important thing to see here is:
- the missing write did not have a "Setting page 00001423 dirty" event
associated with it.
- but I can _see_ where the actual dirty event would be happening in the
logs, because I can see the dirty events of the other chunk writes
around it, so I know exactly where that fourth write happens. And
indeed, it _shouldn't_ get a dirty event, because the page is still
dirty from the write of chunk #3 to that page, which _did_ get a dirty
event.
I can see that, because the testing app writes the log of the pages it
writes, and this is the log around the fourth and final write:
...
Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f
Writing chunk 960/15800 (60%) (00156300) PFN: 156
Writing chunk 14465/15800 (60%) (01423fb4) <----
Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7
Writing chunk 556/15800 (60%) (000c62f0) PFN: c6
Writing chunk 15190/15800 (60%) (01526678) PFN: 1526
...
and I can match this up with the full log from the kernel, which looks
like this:
Setting page 0000076e dirty
Setting page 0000076f dirty
Setting page 00000156 dirty
Setting page 000000c6 dirty
Setting page 00001526 dirty
so I know exactly where the missing writes (to our page at pfn 1423,
and the fpn-bf7 page) happened.
- and the thing is, I can see a "cpd_for_io()" happening AFTER that
fourth write. Quite a long while after, in fact. So all of this looks
very fine indeed. We are not losing any dirty bits.
- EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses
the SAME dirty bit as write 4 did (which didn't make it out to disk!).
The event that clears the dirty bit that write 3 did happens AFTER
write 4 has happened!
So if we're not losing any dirty bits, what's going on?
I think we have some nasty interaction with the buffer heads. In
particular, I don't think it's the dirty page bits that are broken (I
_see_ that the PageDirty bit was set after write 4 was done to memory in
the kernel traces). So I think that a real writeback just doesn't happen,
because somebody has marked the buffer heads clean _after_ it started IO
on them.
I think "__mpage_writepage()" is buggy in this regard, for example. It
even has a comment about its crapola behaviour:
/*
* Must try to add the page before marking the buffer clean or
* the confused fail path above (OOM) will be very confused when
* it finds all bh marked clean (i.e. it will not write anything)
*/
however, I don't think that particular thing explains it, because I don't
think we use that function for the cases I'm looking at.
Anyway, I'll add tracing for page-writeback setting/cleaning too, in case
I might see anything new there..
Linus
From: Linus Torvalds <[email protected]>
Date: Thu, 28 Dec 2006 14:37:37 -0800 (PST)
> So if we're not losing any dirty bits, what's going on?
What happens when we writeback, to the PTEs?
page_mkclean_file() iterates the VMAs and when it finds a shared
one it goes:
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
and that's fine, but that PTE is still marked writable, and
I think that's key.
What does the fault path do in this situation?
if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
pte, pmd, ptl, entry);
entry = pte_mkdirty(entry);
}
It does nothing to update the page dirty state, because it's
writable, it just sets the PTE dirty bit and that's it. Should
it be setting the page dirty here for SHARED cases?
So until vmscan actually unmaps the PTE completely, we have this
window in which the application can write to the PTE and the
page dirty state doesn't get updated.
Perhaps something later cleans up after this, f.e. by rechecking the
PTE dirty bit at the end of I/O or when vmscan unmaps the page.
I guess that should handle things, but the above logic definitely
stood out to me.
On Thu, 28 Dec 2006, David Miller wrote:
>
> What happens when we writeback, to the PTEs?
Not a damn thing.
We clear the PTE's _before_ we even start the write. The writeback does
nothing to them. If the user dirties the page while writeback is in
progress, we'll take the page fault and re-dirty it _again_.
> page_mkclean_file() iterates the VMAs and when it finds a shared
> one it goes:
>
> entry = ptep_clear_flush(vma, address, pte);
> entry = pte_wrprotect(entry);
> entry = pte_mkclean(entry);
>
> and that's fine, but that PTE is still marked writable, and
> I think that's key.
No it's not. It's right there. "pte_wrprotect(entry)". You even copied it
yourself.
> What does the fault path do in this situation?
>
> if (write_access) {
> if (!pte_write(entry))
> return do_wp_page(mm, vma, address,
> pte, pmd, ptl, entry);
So we call "do_wp_page()", and that does everythign right.
Linus
On Thu, 28 Dec 2006, Anton Altaparmakov wrote:
>
> But are chunks 3 and 4 in separate buffer heads? Sorry could not see it
> immediately from the output you showed...
No, this is a 4kB filesystem. A single bh per page.
> It is just that there may be a different cause rather than buffer dirty
> state...
Sure.
> A shot in the dark I know but it could perhaps be that a "COW for
> MAP_PRIVATE" like event happens when the page is dirty already thus the
> second write never actually makes it to the shared page thus it never gets
> written out.
There are no private mappings anywhere, and no forks. Just a single mmap
(well, we unmap and remap in order to force the page cache to be
invalidated properly with the posix_fadvise() thing, but that's literally
the only user).
Linus
On Thu, 28 Dec 2006, Linus Torvalds wrote:
> Ok,
> with the ugly trace capture patch, I've actually captured this corruption
> in action, I think.
>
> I did a full trace of all pages involved in one run, and picked one
> corruption at random:
>
> Chunk 14465 corrupted (0-75) (01423fb4-01423fff)
> Expected 129, got 0
> Written as (5126)9509(15017)
>
> That's the first 76 bytes of a chunk missing, and it's the last 76 bytes
> on a page. It's page index 01423 in the mapped file, and bytes fb4-fff
> within that file.
>
> There were four chunks written to that page:
>
> Writing chunk 14463/15800 (15%) (0142344c) (1)
> Writing chunk 14462/15800 (30%) (01422e98) (2) (overflows into 00001423)
> Writing chunk 14464/15800 (32%) (01423a00) (3)
> Writing chunk 14465/15800 (60%) (01423fb4) (4) <--- LOST!
>
> and the other three chunks checked out all right.
>
> And here's the annotated trace as it concerns that page:
>
> - here we write the first chunk to the page:
> ** (1) do_no_page: mapping index 00001423 at b7d1f44c (write)
> ** Setting page 00001423 dirty
>
> - something flushes it out to disk:
> ** cpd_for_io: index 00001423
> ** cleaning index 00001423 at b7d1f000
>
> - here we write the second chunk (which was split over the previous page
> and the interesting one):
> ** (2) Setting page 00001422 dirty
> ** (2) Setting page 00001423 dirty
>
> - and here we do a cleaning event
> ** cpd_for_io: index 00001423
> ** cleaning index 00001423 at b7d1f000
>
> - here we write the third chunk:
> ** (3) Setting page 00001423 dirty
>
> - here we write the fourth chunk:
> ** (4) NO DIRTY EVENT
>
> - and a third flush to disk:
> ** cpd_for_io: index 00001423
> ** cleaning index 00001423 at b7d1f000
>
> - here we unmap and flush:
> ** Unmapped index 00001423 at b7d1f000
> ** Removing index 00001423 from page cache
>
> - here we remap to check:
> ** do_no_page: mapping index 00001423 at b7d1f000 (read)
> ** Unmapped index 00001423 at b7d1f000
>
> - and finally, here I remove the file after the run:
> ** Removing index 00001423 from page cache
>
> Now, the important thing to see here is:
>
> - the missing write did not have a "Setting page 00001423 dirty" event
> associated with it.
>
> - but I can _see_ where the actual dirty event would be happening in the
> logs, because I can see the dirty events of the other chunk writes
> around it, so I know exactly where that fourth write happens. And
> indeed, it _shouldn't_ get a dirty event, because the page is still
> dirty from the write of chunk #3 to that page, which _did_ get a dirty
> event.
>
> I can see that, because the testing app writes the log of the pages it
> writes, and this is the log around the fourth and final write:
>
> ...
> Writing chunk 5338/15800 (60%) (0076eb48) PFN: 76e/76f
> Writing chunk 960/15800 (60%) (00156300) PFN: 156
> Writing chunk 14465/15800 (60%) (01423fb4) <----
> Writing chunk 8594/15800 (60%) (00bf74a8) PFN: bf7
> Writing chunk 556/15800 (60%) (000c62f0) PFN: c6
> Writing chunk 15190/15800 (60%) (01526678) PFN: 1526
> ...
>
> and I can match this up with the full log from the kernel, which looks
> like this:
>
> Setting page 0000076e dirty
> Setting page 0000076f dirty
> Setting page 00000156 dirty
> Setting page 000000c6 dirty
> Setting page 00001526 dirty
>
> so I know exactly where the missing writes (to our page at pfn 1423,
> and the fpn-bf7 page) happened.
>
> - and the thing is, I can see a "cpd_for_io()" happening AFTER that
> fourth write. Quite a long while after, in fact. So all of this looks
> very fine indeed. We are not losing any dirty bits.
>
> - EVEN MORE INTERESTING: write 3 makes it onto disk, and it really uses
> the SAME dirty bit as write 4 did (which didn't make it out to disk!).
> The event that clears the dirty bit that write 3 did happens AFTER
> write 4 has happened!
>
> So if we're not losing any dirty bits, what's going on?
>
> I think we have some nasty interaction with the buffer heads. In
But are chunks 3 and 4 in separate buffer heads? Sorry could not see it
immediately from the output you showed...
It is just that there may be a different cause rather than buffer dirty
state...
A shot in the dark I know but it could perhaps be that a "COW for
MAP_PRIVATE" like event happens when the page is dirty already thus the
second write never actually makes it to the shared page thus it never gets
written out.
I am almost certainly totally barking up the wrong tree but I thought it
may be worth mentioning just in case there was a slip in the COW logic or
page writable state maintenance somewhere...
Best regards,
Anton
> particular, I don't think it's the dirty page bits that are broken (I
> _see_ that the PageDirty bit was set after write 4 was done to memory in
> the kernel traces). So I think that a real writeback just doesn't happen,
> because somebody has marked the buffer heads clean _after_ it started IO
> on them.
>
> I think "__mpage_writepage()" is buggy in this regard, for example. It
> even has a comment about its crapola behaviour:
>
> /*
> * Must try to add the page before marking the buffer clean or
> * the confused fail path above (OOM) will be very confused when
> * it finds all bh marked clean (i.e. it will not write anything)
> */
>
> however, I don't think that particular thing explains it, because I don't
> think we use that function for the cases I'm looking at.
>
> Anyway, I'll add tracing for page-writeback setting/cleaning too, in case
> I might see anything new there..
>
> Linus
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
Btw,
much cleaned-up page tracing patch here, in case anybody cares (and
"test.c" attached, although I don't think it changed since last time).
The test.c output is a bit hard to read at times, since it will give
offsets in bytes as hex (ie "00a77664" means page frame 00000a77, and byte
664h within that page), while the kernel output is obvioiusly the page
indexes (but the page fault _addresses_ can contain information about the
exact byte in a page, so you can match them up when some kernel event is
related to a page fault).
So both forms are necessary/logical, but it means that to match things up,
you often need to ignore the last three hex digits of the address that
"test.c" outputs.
This one also adds traces for the tags and the writeback activity, but
since I'm going out for birthday dinner, I won't have time to try to
actually analyse the trace I have.. Which is why I'm sending it out, in
the hope that somebody else is working on this corruption issue and is
interested..
Linus
----
diff --git a/fs/buffer.c b/fs/buffer.c
index 263f88e..f5e132a 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -722,6 +722,7 @@ int __set_page_dirty_buffers(struct page *page)
set_buffer_dirty(bh);
bh = bh->b_this_page;
} while (bh != head);
+ PAGE_TRACE(page, "dirtied buffers");
}
spin_unlock(&mapping->private_lock);
@@ -734,6 +735,7 @@ int __set_page_dirty_buffers(struct page *page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
+ PAGE_TRACE(page, "setting TAG_DIRTY");
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 350878a..0cf3dce 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -91,6 +91,14 @@
#define PG_nosave_free 18 /* Used for system suspend/resume */
#define PG_buddy 19 /* Page is free, on buddy lists */
+#define SetPageInteresting(page) set_bit(PG_arch_1, &(page)->flags)
+#define PageInteresting(page) test_bit(PG_arch_1, &(page)->flags)
+
+#define PAGE_TRACE(page, msg, arg...) do { \
+ if (PageInteresting(page)) \
+ printk(KERN_DEBUG "PG %08lx: %s:%d " msg "\n", \
+ (page)->index, __FILE__, __LINE__ ,##arg ); \
+} while (0)
#if (BITS_PER_LONG > 32)
/*
@@ -183,32 +191,38 @@ static inline void SetPageUptodate(struct page *page)
#define PageWriteback(page) test_bit(PG_writeback, &(page)->flags)
#define SetPageWriteback(page) \
do { \
- if (!test_and_set_bit(PG_writeback, \
- &(page)->flags)) \
+ if (!test_and_set_bit(PG_writeback, &(page)->flags)) { \
+ PAGE_TRACE(page, "set writeback"); \
inc_zone_page_state(page, NR_WRITEBACK); \
+ } \
} while (0)
#define TestSetPageWriteback(page) \
({ \
int ret; \
ret = test_and_set_bit(PG_writeback, \
&(page)->flags); \
- if (!ret) \
+ if (!ret) { \
+ PAGE_TRACE(page, "set writeback"); \
inc_zone_page_state(page, NR_WRITEBACK); \
+ } \
ret; \
})
#define ClearPageWriteback(page) \
do { \
- if (test_and_clear_bit(PG_writeback, \
- &(page)->flags)) \
+ if (test_and_clear_bit(PG_writeback, &(page)->flags)) { \
+ PAGE_TRACE(page, "end writeback"); \
dec_zone_page_state(page, NR_WRITEBACK); \
+ } \
} while (0)
#define TestClearPageWriteback(page) \
({ \
int ret; \
ret = test_and_clear_bit(PG_writeback, \
&(page)->flags); \
- if (ret) \
+ if (ret) { \
+ PAGE_TRACE(page, "end writeback"); \
dec_zone_page_state(page, NR_WRITEBACK); \
+ } \
ret; \
})
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 5c26818..7735b83 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -79,7 +79,7 @@ config DEBUG_KERNEL
config LOG_BUF_SHIFT
int "Kernel log buffer size (16 => 64KB, 17 => 128KB)" if DEBUG_KERNEL
- range 12 21
+ range 12 24
default 17 if S390 || LOCKDEP
default 16 if X86_NUMAQ || IA64
default 15 if SMP
diff --git a/mm/filemap.c b/mm/filemap.c
index 8332c77..4df7d35 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -116,6 +116,7 @@ void __remove_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ PAGE_TRACE(page, "Removing page cache");
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
@@ -421,6 +422,23 @@ int filemap_write_and_wait_range(struct address_space *mapping,
return err;
}
+static noinline int is_interesting(struct address_space *mapping)
+{
+ struct inode *inode = mapping->host;
+ struct dentry *dentry;
+ int retval = 0;
+
+ spin_lock(&dcache_lock);
+ list_for_each_entry(dentry, &inode->i_dentry, d_alias) {
+ if (strcmp(dentry->d_name.name, "mapfile"))
+ continue;
+ retval = 1;
+ break;
+ }
+ spin_unlock(&dcache_lock);
+ return retval;
+}
+
/**
* add_to_page_cache - add newly allocated pagecache pages
* @page: page to add
@@ -439,6 +457,9 @@ int add_to_page_cache(struct page *page, struct address_space *mapping,
{
int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ if (is_interesting(mapping))
+ SetPageInteresting(page);
+
if (error == 0) {
write_lock_irq(&mapping->tree_lock);
error = radix_tree_insert(&mapping->page_tree, offset, page);
diff --git a/mm/memory.c b/mm/memory.c
index 563792f..20af32f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -667,6 +667,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
tlb_remove_tlb_entry(tlb, pte, addr);
if (unlikely(!page))
continue;
+ PAGE_TRACE(page, "unmapped at %08lx", addr);
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
addr) != page->index)
@@ -1605,6 +1606,7 @@ gotten:
*/
ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
+ PAGE_TRACE(new_page, "write fault at %08lx", address);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
page_add_new_anon_rmap(new_page, vma, address);
@@ -2249,6 +2251,7 @@ retry:
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ PAGE_TRACE(new_page, "mapping at %08lx (%s)", address, write_access ? "write" : "read");
set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..15f3aaf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -773,6 +773,7 @@ int __set_page_dirty_nobuffers(struct page *page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
task_io_account_write(PAGE_CACHE_SIZE);
}
+ PAGE_TRACE(page, "setting TAG_DIRTY");
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
@@ -813,6 +814,7 @@ int fastcall set_page_dirty(struct page *page)
if (!spd)
spd = __set_page_dirty_buffers;
#endif
+ PAGE_TRACE(page, "setting dirty");
return (*spd)(page);
}
if (!PageDirty(page)) {
@@ -867,6 +869,7 @@ int clear_page_dirty_for_io(struct page *page)
if (TestClearPageDirty(page)) {
if (mapping_cap_account_dirty(mapping)) {
+ PAGE_TRACE(page, "clean_for_io");
page_mkclean(page);
dec_zone_page_state(page, NR_FILE_DIRTY);
}
@@ -886,10 +889,12 @@ int test_clear_page_writeback(struct page *page)
write_lock_irqsave(&mapping->tree_lock, flags);
ret = TestClearPageWriteback(page);
- if (ret)
+ if (ret) {
+ PAGE_TRACE(page, "clearing TAG_WRITEBACK");
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+ }
write_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
@@ -907,14 +912,18 @@ int test_set_page_writeback(struct page *page)
write_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
- if (!ret)
+ if (!ret) {
+ PAGE_TRACE(page, "setting TAG_WRITEBACK");
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (!PageDirty(page))
+ }
+ if (!PageDirty(page)) {
+ PAGE_TRACE(page, "clearing TAG_DIRTY");
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
+ }
write_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestSetPageWriteback(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 57306fa..e6b4676 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -448,6 +448,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;
+ PAGE_TRACE(page, "cleaning PTE %08lx", address);
flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
@@ -637,6 +638,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
goto out_unmap;
}
+ PAGE_TRACE(page, "unmapping from %08lx", address);
/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
pteval = ptep_clear_flush(vma, address, pte);
@@ -767,6 +769,7 @@ static void try_to_unmap_cluster(unsigned long cursor,
if (ptep_clear_flush_young(vma, address, pte))
continue;
+ PAGE_TRACE(page, "unmapping from %08lx", address);
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
On Thu, 28 Dec 2006 17:38:38 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> in
> the hope that somebody else is working on this corruption issue and is
> interested..
What corruption issue? ;)
I'm finding that the corruption happens trivially with your test app, but
apparently doesn't happen at all with ext2 or ext3, data=writeback. Maybe
it will happen with increased rarity, but the difference is quite stark.
Removing the
err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE,
NULL, journal_dirty_data_fn);
from ext3_ordered_writepage() fixes things up.
The things which journal_submit_data_buffers() does after dropping all the
locks are ... disturbing - I don't think we have sufficient tests in there
to ensure that the buffer is still where we think it is after we retake
locks (they're slippery little buggers). But that wouldn't explain it
anyway.
It's inefficient that journal_dirty_data() will put these locked, clean
buffers onto BJ_SyncData instead of BJ_Locked, but
journal_submit_data_buffers() seems to dtrt with them.
So no theory yet. Maybe ext3 is just altering timing. But the difference
is really large..
Disabling all the WB_SYNC_NONE stuff and making everything go synchronous
everywhere has no effect. Disabling bdi_write_congested() has no effect.
> I think what might be happening is that pdflush writes them out fine,
> however we don't trap writes by the application _during_ that writeout.
Yeah. I believe that more exactly it happens if the very last
write to the page causes a writeback (due to dirty balancing)
while another writeback for the page is already happening.
As usual in these cases, I have zero proof.
> It's something that will only occur with writeback and MAP_SHARED
> writable access to the file pages.
It's the do_wp_page -> balance_dirty_pages -> generic_writepages
path for sure. Maybe it's enough to change
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page) ||
!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
to
if (wbc->sync_mode != WB_SYNC_NONE)
wait_on_page_writeback(page);
if (PageWriteback(page)) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
continue;
}
if (!clear_page_dirty_for_io(page)) {
unlock_page(page);
continue;
}
or something along those lines. Completely untested of course :-)
Segher
On Fri, 29 Dec 2006, Segher Boessenkool wrote:
>
> > I think what might be happening is that pdflush writes them out fine,
> > however we don't trap writes by the application _during_ that writeout.
>
> Yeah. I believe that more exactly it happens if the very last
> write to the page causes a writeback (due to dirty balancing)
> while another writeback for the page is already happening.
>
> As usual in these cases, I have zero proof.
I actually have proof to the contrary, ie I have traces that say "the
write was started" after the last write.
And the VM layer in this area is actually fairly sane and civilized. It
has a bit that says "writeback in progress", and if that bit is set, it
simply _will_not_ start a new write. It even has various BUG_ON()'s to
that effect.
So everything I have ever seen says that the VM layer is actually doing
everything right.
> It's the do_wp_page -> balance_dirty_pages -> generic_writepages
> path for sure. Maybe it's enough to change
>
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
>
> if (PageWriteback(page) ||
> !clear_page_dirty_for_io(page)) {
> unlock_page(page);
> continue;
> }
Notive how this one basically says:
- if it's under writeback, don't even clear the page dirty flag.
Your suggested change:
> if (wbc->sync_mode != WB_SYNC_NONE)
> wait_on_page_writeback(page);
>
> if (PageWriteback(page)) {
> redirty_page_for_writepage(wbc, page);
makes no sense, because we simply never _did_ the "clear_page_dirty()" if
the thing was under writeback in the first place. That's how C
conditionals work. So there's no reason to "redirty" it, because it
wasn't cleaned in the first place.
I've double- and triple-checked the dirty bits, including having traces
that actually say that the IO was started (from a VM perspective) _after_
the last write was done. The IO just didn't hit the disk.
I'm personally fairly convinced that it's not a VM issue, but a "IO
issue". Either in a low-level filesystem or in some of the fs/buffer.c
helper routines.
But I'd love to be proven wrong.
I do have a few interesting details from the trace I haven't really
analyzed yet. Here's the trace for events on one of the pages that was
corrupted. Note how the events are numbered (there were 171640 events
total), so the thing you see is just a small set of events from the whole
big trace, but it's the ones that talk about _that_ particular page.
I've grouped them so hat "consecutive" events group together. That just
means that no events on any other pages happened in between those events,
and it is usually a sign that it's really one single call-chain that
causes all the events.
For example, for the first group of three events (44366-44368), it's the
page fault that brings in the page, and since it's a write-fault, it will
not only map the page, it will mark the page itself dirty and then also
set the TAG_DIRTY on the mapping. So the "group" is just really a result
of one single event happening, which causes several things to happen to
that page. That's exactly what you'd expect.
Anyway, here is the list of events that page went through:
44366 PG 00000f6d: mm/memory.c:2254 mapping at b789fc54 (write)
44367 PG 00000f6d: mm/page-writeback.c:817 setting dirty
44368 PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY
64231 PG 00000f6d: mm/page-writeback.c:872 clean_for_io
64232 PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
64233 PG 00000f6d: mm/page-writeback.c:914 set writeback
64234 PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
64235 PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
67570 PG 00000f6d: mm/page-writeback.c:891 end writeback
67571 PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK
76705 PG 00000f6d: mm/page-writeback.c:817 setting dirty
76706 PG 00000f6d: fs/buffer.c:725 dirtied buffers
76707 PG 00000f6d: fs/buffer.c:738 setting TAG_DIRTY
105267 PG 00000f6d: mm/page-writeback.c:872 clean_for_io
105268 PG 00000f6d: mm/rmap.c:451 cleaning PTE b789f000
105269 PG 00000f6d: mm/page-writeback.c:914 set writeback
105270 PG 00000f6d: mm/page-writeback.c:916 setting TAG_WRITEBACK
105271 PG 00000f6d: mm/page-writeback.c:922 clearing TAG_DIRTY
105272 PG 00000f6d: mm/page-writeback.c:891 end writeback
105273 PG 00000f6d: mm/page-writeback.c:893 clearing TAG_WRITEBACK
128032 PG 00000f6d: mm/memory.c:670 unmapped at b789f000
132662 PG 00000f6d: mm/filemap.c:119 Removing page cache
148278 PG 00000f6d: mm/memory.c:2254 mapping at b789f000 (read)
166326 PG 00000f6d: mm/memory.c:670 unmapped at b789f000
171958 PG 00000f6d: mm/filemap.c:119 Removing page cache
And notice that big grouping of seven events (105267-105273). The five
first events really _do_ make sense together: it's our page cleaning that
happens. But notice how the "end writeback" happens _immediately_.
Here's another page cleaning event for the page that preceded that page,
and did _not_ get corrupted:
105262 PG 00000f6c: mm/page-writeback.c:872 clean_for_io
105263 PG 00000f6c: mm/rmap.c:451 cleaning PTE b789e000
105264 PG 00000f6c: mm/page-writeback.c:914 set writeback
105265 PG 00000f6c: mm/page-writeback.c:916 setting TAG_WRITEBACK
105266 PG 00000f6c: mm/page-writeback.c:922 clearing TAG_DIRTY
108437 PG 00000f6c: mm/page-writeback.c:891 end writeback
108438 PG 00000f6c: mm/page-writeback.c:893 clearing TAG_WRITEBACK
and this looks a lot more like what you'd expect: other thngs happened in
between the "clear dirty, set writeback" stage and the "end writeback"
stage. That's what you'd expect to see if there was actually overlapping
IO and/or work.
(And notice that that _was_ what you saw even for the corrupted page for
the _first_ writeback: you saw the group-of-five that indicated a page
cleaning event had started, and then a group-of-two to indicate that the
writeback finished).
So I find this kind of pattern really suspicious. We have a missing
writeout, and my traces show (I didn't analyze this _particular_ one
closely, but I did the previous trace for another page that I posted) that
the writeback was actually started after the write that went missing was
done. AND I have this trace that seems to show that the writeback
basically completed immediately, with no other work in between.
That to me says: "somebody didn't actually write out out". The VM layer
asked the filesystem to do the write, but the filesystem just didn't do
it. I personally think it's because some buffer-head BH_dirty bit got
scrogged, but it could be some event that makes the filesystem simply not
do the IO because it thinks the "disk queues are too full", so it just
says "IO completed", without actually doing anything at all.
Now, the fact that it apparently happens for all of ext2, ext3
and reiserfs (but NOT apparently with "data=writeback"), makes me suspect
that there is some common interaction, and that it's somehow BH-related
(they all share much of the buffer head infrastructure). So it doesn't
look like it's just a bug in one random filesystem, I think it's a bug in
some buffer-head infrastructure/helper function.
So I don't think it's "core VM". I don't think it's the "page cache". I
think we handle the dirty state correctly at that level.
It looks more like "buffer cache" or "filesystem" to me by now.
(Btw, don't get me wrong - the above sequence numbers are in no way
*proof* of anything. You could get big groups for one page just because
something ended up being synchronous. I'll add some timestamps to my
traces to make it easier to see where there was real IO going on and where
there wasn't).
Linus
On Thu, 28 Dec 2006, Linus Torvalds wrote:
>
> So everything I have ever seen says that the VM layer is actually doing
> everything right.
That was true, but at the same time, it's not. Let me explain.
> That to me says: "somebody didn't actually write out out". The VM layer
> asked the filesystem to do the write, but the filesystem just didn't do
> it. I personally think it's because some buffer-head BH_dirty bit got
> scrogged
Ok, I have proof of this now.
Here's a trace (with cycle counts), and with a new trace event added: this
is for another corrupted page. I have:
49105 PG 000015d8 (14800): mm/page-writeback.c:872 clean_for_io
49106 PG 000015d8 (6900): mm/rmap.c:451 cleaning PTE b7fa6000
49107 PG 000015d8 (9900): mm/page-writeback.c:914 set writeback
49108 PG 000015d8 (6480): mm/page-writeback.c:916 setting TAG_WRITEBACK
49109 PG 000015d8 (7110): mm/page-writeback.c:922 clearing TAG_DIRTY
49110 PG 000015d8 (7190): fs/buffer.c:1713 no IO underway
49111 PG 000015d8 (6180): mm/page-writeback.c:891 end writeback
49112 PG 000015d8 (6460): mm/page-writeback.c:893 clearing TAG_WRITEBACK
where that first column is the trace event number again, and the "PG
000015d8" is that corrupted page. The thing in the parenthesis is "CPU
cycles since last event), and the important part to note is that this is
indeed all one single thing with no actual IO anywhere (~7000 CPU cycles
may sound like a lot, but (a) it's not that many cache misses and (b) a
lot of it is the logging overhead - back-to-back log events will take
about 3500 cycles) just because it does the actual ASCII printk() etc.
Also, the new event is:
fs/buffer.c:1713 no IO underway
which is just the
if (nr_underway == 0)
case in fs/buffer.c
And I now finally really believe that I fully understand the corruption,
but I don't have a simple solution, much less a patch.
What the problem basically boils down to is that "set_page_dirty()" is
supposed to be a mark for dirtying THE WHOLE PAGE, but it really is not
"the whole page when the 'set_page_dirty()' itself happens", but more of a
"the next writepage() needs to write back the whole page" thing.
And that's not what "__set_page_dirty_buffers()" really does.
Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE
"set_page_dirty()" IS CALLED, it will mark all the buffers on that page as
dirty. That may _sound_ like what we want, but it really isn't. Because by
the time "writepage()" is actually called (which can be MUCH MUCH later),
some internal filesystem activity may actually have cleaned one or more of
those buffers in the meantime, and now we call "writepage()" (which really
wants to write them _all_), and it will write only part of them, or none
at all.
So the VM thought that since it did a "writepage()", all the dirty state
at that point got written back. But it didn't - the filesystem could have
written back part or all of the page much earlier, and the writepage()
actually does nothing at all.
Both filesystem and VM actually _think_ they do the right thing, because
they simply have totally different expectations. The filesystem thinks
that it should care about dirty buffers (that got marked dirty _after_
they were dirtied), while the filesystem thinks that it cares about dirty
_pages_ (that got dirted at any time _before_ "writepage()" was called).
Neither is really "wrong", per se, it's just that the two parts have
different expectations, and the _combination_ just doesn't work.
"set_page_dirty()" at some point meant "the writes have been done", but
these days it really means something else.
Now, the reason there is no trivial patch is not that this is conceptually
really hard to fix. I can see several different approaches to fixing it,
but they all really boil down to two alternatives:
(a) splitting the one "PG_dirty" bit up into two bits: the
"PG_writescheduled" bit and the "PG_alldirty" bit.
The "PG_write_scheduled" bit would be the bit that the filesystem
would set when it has pending dirty data that it wrote itself (and
that may not cover the whole page), and is the part of PG_dirty that
sets the PAGECACHE_TAG_DIRTY. It's also what forces "writepage()" to
be called.
The "PG_alldirty" bit is just an additional "somebody else dirtied
random parts of this page, and we don't know what" flag, which is set
by "set_page_dirty()" in addition to doing the PG_write_scheduled
stuff. We would test-and-clear it at "writepage()" time, and pass it
in to "writepages()" to tell the writepage() function that it can't
just write out its own small limited notion of what is dirty.
(There are various variations on this whole theme: instead of having
a flag to "writepage()", we could split the "whole page" case out as
a separate callback or similar)
(b) making sure that all "set_page_dirty()" calls are _after_ the page
has been marked dirty (which in the case of memory mapped pages would
mean that we would _not_ call it when we mark the page writable at
all, we would call it when we _remove_ the dirty bit and mark it
unwritable). That would have the nice fearture that it wouldn't
require any FS-level changes, which would be a nice thing - it would
basically make the VM dirty accounting work the way the FS layer now
already expects it to.
I think (b) is conceptually simpler, and I think I'll try it tomorrow
after I've slept on it. The reason I mention (a) at all is that I like the
conceptual notion of telling he filesystem ahead of time that "you're
going to get a full dirty page", because what (b) will inevitably lead to
is that the filesystem will maintain its own partial state, and then
effectively just before it gets the writepage() notification, it will be
told it was all pointless, because we just dirtied the whole thing.
IOW, the advantage of (a) is also it's disadvantage: it tells the
filesystem more. The disadvantage is that it would require VFS interface
changes exactly to do that (ie the "mapping->set_page_dirty()" semantics
would also be split up into two, and it would now be a "prepare to write
the whole page during the next 'writepage()'" thing).
So to recap: the VM layer really expected "writepage()" to act as if it
wrote out the whole page. It doesn't. Not in the presense of the buffer
layer and the filesystem having written out some buffers independently of
the VM layer earlier.
I think this also explains why "data=ordered" broke, and "data=writeback"
didn't. When ext3 does "ordered" writebacks, it will do file data
writebacks on its own, in _its_ order. In contrast, when it does
"data=writeback", it will do the writebacks exactly as the VM presents
them, and won't write any buffers on its own - which makes the bug go
away, because now VM and FS end up agreeing about the semantics of
"writepage()".
Andrew, do you see anything wrong in my thinking?
Peter - on a VM level, the fix would be:
- remove the "set_page_dirty()" from the page fault path, and just set
the PAGECACHE_TAG_DIRTY instead.
- clear_page_dirty_for_io() would now need to check the mappings of the
page even if it wasn't marked PG_dirty (or we'd have another page flag
for the "page is dirty in page tables"), which is kind of a mixture of
(a) and (b) cases above, except we don't expose it to the FS.
- if it was dirty in the page tables, we do a "set_page_dirty()" after
cleaning the page tables, and then the rest of
"clear_page_dirty_for_io()" really boils down to a simple
"TestAndClearDirty(page)"
Hmm? I'd love it if somebody else wrote the patch and tested it, because
I'm getting sick and tired of this bug ;)
Linus
> On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
> >
> >
> > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
> > > > me up), and that seems to show the corruption going way way back (ie going
> > > > back to Linux-2.6.5 at least, according to one tester).
> > >
> > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> > > (or older)?
> >
> > Well, that was a really _old_ fedora kernel. I guarantee you it didn't
> > have the page throttling patches in it, those were written this summer. So
> > it would either have to be Fedora carrying around another patch that just
> > happens to result in the same corruption for _years_, or it's the same
> > bug.
>
> The only notable VM patch in Fedora kernels of that vintage that I recall
> was Ingo's 4g/4g thing.
>
> Dave
no the fedora 2.6.18 kernel is affected.
it carries the same -mm patches that Debian backported
for LSB 3.1 compliance.
--
maks
ps sorry for stripping cc, only downloaded that message raw.
On Fri, 29 Dec 2006, Linus Torvalds wrote:
>
> Hmm? I'd love it if somebody else wrote the patch and tested it, because
> I'm getting sick and tired of this bug ;)
Who the hell am I kidding? I haven't been able to sleep right for the last
few days over this bug. It was really getting to me.
And putting on the thinking cap, there's actually a fairly simple an
nonintrusive patch. It still has a tiny tiny race (see the comment), but I
bet nobody can really hit it in real life anyway, and I know several ways
to fix it, so I'm not really _that_ worried about it.
The patch is mostly a comment. The "real" meat of it is actually just a
few lines.
Can anybody get corruption with this thing applied? It goes on top of
plain v2.6.20-rc2.
Linus
----
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b3a198c..ec01da1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
{
struct address_space *mapping = page_mapping(page);
- if (!mapping)
- return TestClearPageDirty(page);
-
- if (TestClearPageDirty(page)) {
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
+ if (mapping && mapping_cap_account_dirty(mapping)) {
+ /*
+ * Yes, Virginia, this is indeed insane.
+ *
+ * We use this sequence to make sure that
+ * (a) we account for dirty stats properly
+ * (b) we tell the low-level filesystem to
+ * mark the whole page dirty if it was
+ * dirty in a pagetable. Only to then
+ * (c) clean the page again and return 1 to
+ * cause the writeback.
+ *
+ * This way we avoid all nasty races with the
+ * dirty bit in multiple places and clearing
+ * them concurrently from different threads.
+ *
+ * Note! Normally the "set_page_dirty(page)"
+ * has no effect on the actual dirty bit - since
+ * that will already usually be set. But we
+ * need the side effects, and it can help us
+ * avoid races.
+ *
+ * We basically use the page "master dirty bit"
+ * as a serialization point for all the different
+ * threds doing their things.
+ *
+ * FIXME! We still have a race here: if somebody
+ * adds the page back to the page tables in
+ * between the "page_mkclean()" and the "TestClearPageDirty()",
+ * we might have it mapped without the dirty bit set.
+ */
+ if (page_mkclean(page))
+ set_page_dirty(page);
+ if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
+ return 1;
}
- return 1;
+ return 0;
}
- return 0;
+ return TestClearPageDirty(page);
}
EXPORT_SYMBOL(clear_page_dirty_for_io);
On Fri, 2006-12-29 at 02:48 -0800, Linus Torvalds wrote:
>
> On Fri, 29 Dec 2006, Linus Torvalds wrote:
> >
> > Hmm? I'd love it if somebody else wrote the patch and tested it, because
> > I'm getting sick and tired of this bug ;)
>
> Who the hell am I kidding? I haven't been able to sleep right for the last
> few days over this bug. It was really getting to me.
>
> And putting on the thinking cap, there's actually a fairly simple an
> nonintrusive patch. It still has a tiny tiny race (see the comment), but I
> bet nobody can really hit it in real life anyway, and I know several ways
> to fix it, so I'm not really _that_ worried about it.
>
> The patch is mostly a comment. The "real" meat of it is actually just a
> few lines.
>
> Can anybody get corruption with this thing applied? It goes on top of
> plain v2.6.20-rc2.
Tested with rtorrent and there is no corruption.
>
> Linus
>
> ----
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b3a198c..ec01da1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
>
> - if (!mapping)
> - return TestClearPageDirty(page);
> -
> - if (TestClearPageDirty(page)) {
> - if (mapping_cap_account_dirty(mapping)) {
> - page_mkclean(page);
> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> + * Yes, Virginia, this is indeed insane.
> + *
> + * We use this sequence to make sure that
> + * (a) we account for dirty stats properly
> + * (b) we tell the low-level filesystem to
> + * mark the whole page dirty if it was
> + * dirty in a pagetable. Only to then
> + * (c) clean the page again and return 1 to
> + * cause the writeback.
> + *
> + * This way we avoid all nasty races with the
> + * dirty bit in multiple places and clearing
> + * them concurrently from different threads.
> + *
> + * Note! Normally the "set_page_dirty(page)"
> + * has no effect on the actual dirty bit - since
> + * that will already usually be set. But we
> + * need the side effects, and it can help us
> + * avoid races.
> + *
> + * We basically use the page "master dirty bit"
> + * as a serialization point for all the different
> + * threds doing their things.
> + *
> + * FIXME! We still have a race here: if somebody
> + * adds the page back to the page tables in
> + * between the "page_mkclean()" and the "TestClearPageDirty()",
> + * we might have it mapped without the dirty bit set.
> + */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
> }
> - return 1;
> + return 0;
> }
> - return 0;
> + return TestClearPageDirty(page);
> }
> EXPORT_SYMBOL(clear_page_dirty_for_io);
>
Hey nice work Linus!
Linus Torvalds wrote:
>
> On Fri, 29 Dec 2006, Linus Torvalds wrote:
>
>>Hmm? I'd love it if somebody else wrote the patch and tested it, because
>>I'm getting sick and tired of this bug ;)
>
>
> Who the hell am I kidding? I haven't been able to sleep right for the last
> few days over this bug. It was really getting to me.
>
> And putting on the thinking cap, there's actually a fairly simple an
> nonintrusive patch.
Yeah *this* makes more sense. And in retrospect it was simple, we
can't just throw out pte dirtiness information if the page doesn't
have all buffers dirtied.
> It still has a tiny tiny race (see the comment), but I
> bet nobody can really hit it in real life anyway, and I know several ways
> to fix it, so I'm not really _that_ worried about it.
Well the race isn't a data loss one, is it? Just a case where the
pte may be dirty but the page dirty state not accounted for.
Can we fix it by just putting the page_mkclean back inside the
TestClearPageDirty check, and re-clearing PG_dirty after redoing
the set_page_dirty?
>
> The patch is mostly a comment. The "real" meat of it is actually just a
> few lines.
>
> Can anybody get corruption with this thing applied? It goes on top of
> plain v2.6.20-rc2.
>
> Linus
>
> ----
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b3a198c..ec01da1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -862,17 +862,46 @@ int clear_page_dirty_for_io(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
>
> - if (!mapping)
> - return TestClearPageDirty(page);
> -
> - if (TestClearPageDirty(page)) {
> - if (mapping_cap_account_dirty(mapping)) {
> - page_mkclean(page);
> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> + * Yes, Virginia, this is indeed insane.
> + *
> + * We use this sequence to make sure that
> + * (a) we account for dirty stats properly
> + * (b) we tell the low-level filesystem to
> + * mark the whole page dirty if it was
> + * dirty in a pagetable. Only to then
> + * (c) clean the page again and return 1 to
> + * cause the writeback.
> + *
> + * This way we avoid all nasty races with the
> + * dirty bit in multiple places and clearing
> + * them concurrently from different threads.
> + *
> + * Note! Normally the "set_page_dirty(page)"
> + * has no effect on the actual dirty bit - since
> + * that will already usually be set. But we
> + * need the side effects, and it can help us
> + * avoid races.
> + *
> + * We basically use the page "master dirty bit"
> + * as a serialization point for all the different
> + * threds doing their things.
> + *
> + * FIXME! We still have a race here: if somebody
> + * adds the page back to the page tables in
> + * between the "page_mkclean()" and the "TestClearPageDirty()",
> + * we might have it mapped without the dirty bit set.
> + */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
> }
> - return 1;
> + return 0;
> }
> - return 0;
> + return TestClearPageDirty(page);
> }
> EXPORT_SYMBOL(clear_page_dirty_for_io);
>
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
* Linus Torvalds <[email protected]> wrote:
> I do have a few interesting details from the trace I haven't really
> analyzed yet. Here's the trace for events on one of the pages that was
> corrupted. Note how the events are numbered (there were 171640 events
> total), so the thing you see is just a small set of events from the
> whole big trace, but it's the ones that talk about _that_ particular
> page.
i've extended the tracer in -rt to trace all relevant pagetable,
pagecache, buffer-cache and IO events and coupled the tracer to your
test.c code. The corruption happens here:
test-2126 0.... 3756170us+: trace_page (cf20ebd8 b6a2c000 0)
pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420)
test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420)
test-2126 0D... 9012933us+: do_page_fault (8048900 4 b6a2c000)
test-2126 0.... 9023278us+: trace_page (cf262f24 b6a2c000 0)
test-2126 0.... 9023305us > sys_prctl (000000d8 b6a2c000 000000ac)
address 0xb6a2c000 is the one that shows the corruption. Now, this
address is mapped to page cf262f24 when the bug happened, but it had
page 0xcf20ebd8 mapped to it 3 seconds ago, which has this history:
test-2126 0.... 3756413us+: trace_page (cf20ebd8 0 0)
test-2126 0.... 3756469us+: trace_page (cf20ebd8 0 0)
test-2126 0.... 3757341us+: trace_page (cf20ebd8 10 0)
IRQ-14-402 0.... 3759332us+: trace_page (cf20ebd8 ffffffff 0)
IRQ-14-402 0.... 3759376us+: trace_page (cf20ebd8 ffffffff 0)
test-2126 0.... 5104662us+: trace_page (cf20ebd8 b6a2c400 0)
test-2126 0.... 5104687us+: trace_page (cf20ebd8 1 0)
pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420)
pdflush-2006 0.... 6432952us+: trace_page (cf20ebd8 ffffffff 4200420)
pdflush-2006 0.... 6432986us+: trace_page (cf20ebd8 1 4200420)
pdflush-2006 0.... 6433022us+: trace_page (cf20ebd8 4096 4200420)
pdflush-2006 0.... 6433061us+: trace_page (cf20ebd8 0 4200420)
pdflush-2006 0.... 6433112us+: trace_page (cf20ebd8 0 4200420)
pdflush-2006 0.... 6433154us+: trace_page (cf20ebd8 0 4200420)
pdflush-2006 0.... 6433303us+: trace_page (cf20ebd8 11 4200420)
pdflush-2006 0.... 6433343us+: trace_page (cf20ebd8 13 4200420)
pdflush-2006 0.... 6433382us+: trace_page (cf20ebd8 14 4200420)
pdflush-2006 0.... 6433421us+: trace_page (cf20ebd8 15 4200420)
pdflush-2006 0.... 6433460us+: trace_page (cf20ebd8 ffffffff 4200420)
pdflush-2006 0.... 6433504us+: trace_page (cf20ebd8 ffffffff 4200420)
test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420)
in particular timestamp 6433421us is interesting:
pdflush-2006 0.... 6433504us+: trace_page (cf20ebd8 ffffffff 4200420)
pdflush-2006 0.... 6433526us : trace_page()<-test_clear_page_writeback()<-end_page_writeback()<-__block_write_full_page()
pdflush-2006 0.... 6433526us+: block_write_full_page()<-ext3_ordered_writepage()<-generic_writepages()<-(-1)()
i.e. the page got its pending writeback cancelled in
block_write_full_page(), without any IRQ#14 activity whatsoever! That
looks quite suspect. It is this piece of code in
__block_write_full_page():
/*
* The page was marked dirty, but the buffers were
* clean. Someone wrote them back by hand with
* ll_rw_block/submit_bh. A rare case.
*/
....
if (uptodate)
SetPageUptodate(page);
end_page_writeback(page);
A 'rare case' ... hm. So i tried a quick workaround below, just to keep
us from marking the page clean, to see whether the corruption goes away
- and i was unable to trigger the corruption after half an hour of
testing, while before it triggered within 10 seconds!
now this patch is only an ugly hack, but the bug definitely seems to be
related to buffer management, as you suspected.
Ingo
---
fs/buffer.c | 1 +
1 file changed, 1 insertion(+)
Index: linux/fs/buffer.c
===================================================================
--- linux.orig/fs/buffer.c
+++ linux/fs/buffer.c
@@ -1702,6 +1702,7 @@ done:
} while (bh != head);
if (uptodate)
SetPageUptodate(page);
+ set_page_dirty(page);
end_page_writeback(page);
/*
* The page and buffer_heads can be released at any time from
* Linus Torvalds <[email protected]> wrote:
> > Hmm? I'd love it if somebody else wrote the patch and tested it,
> > because I'm getting sick and tired of this bug ;)
>
> Who the hell am I kidding? I haven't been able to sleep right for the
> last few days over this bug. It was really getting to me.
>
> And putting on the thinking cap, there's actually a fairly simple an
> nonintrusive patch. [...]
ok, your patch seems to fix the testcase here too on -rc2-rt.
[ Damn, i should have slept a bit more, that would have saved me a ~4
hour debug and tracing session today to analyze your testcase, just to
find your patch and your explanation on lkml, right after i sent my
analysis and workaround patch ;-) At least now we know it from two
independent tracing results that the suspect code is the same. ]
Ingo
Linus Torvalds wrote:
>[...]
> The patch is mostly a comment. The "real" meat of it is actually just a
> few lines.
>
> Can anybody get corruption with this thing applied? It goes on top of
> plain v2.6.20-rc2.
No corruption with the testcase here. Will check with rtorrent too later
today but I suppose it will work just fine.
Nice work! It has been interesting (and educating) to follow this
bug-hunt :)
/Martin
* Linus Torvalds <[email protected]> [2006-12-29 02:48]:
> Can anybody get corruption with this thing applied? It goes on top
> of plain v2.6.20-rc2.
It works for me now, both your testcase as well as an installation of
Debian on this ARM device. I manually applied the patch to 2.6.19.
Thanks.
--
Martin Michlmayr
http://www.cyrius.com/
On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
> > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
> > >
> > >
> > > On Thu, 28 Dec 2006, Petri Kaukasoina wrote:
> > > > > me up), and that seems to show the corruption going way way back (ie going
> > > > > back to Linux-2.6.5 at least, according to one tester).
> > > >
> > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> > > > (or older)?
> > >
> > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't
> > > have the page throttling patches in it, those were written this summer. So
> > > it would either have to be Fedora carrying around another patch that just
> > > happens to result in the same corruption for _years_, or it's the same
> > > bug.
> >
> > The only notable VM patch in Fedora kernels of that vintage that I recall
> > was Ingo's 4g/4g thing.
>
> no the fedora 2.6.18 kernel is affected.
I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.
> it carries the same -mm patches that Debian backported
> for LSB 3.1 compliance.
The only -mm stuff I recall being in the Fedora 2.6.18 is
the inode-diet stuff which ended up in 2.6.19, though the xmas
break has left my head somewhat empty so I may be forgetting something.
What patch in particular are you talking about?
Dave
--
http://www.codemonkey.org.uk
Martin Michlmayr wrote:
>* Linus Torvalds <[email protected]> [2006-12-29 02:48]:
>
>
>>Can anybody get corruption with this thing applied? It goes on top
>>of plain v2.6.20-rc2.
>>
>>
>
>It works for me now, both your testcase as well as an installation of
>Debian on this ARM device. I manually applied the patch to 2.6.19.
>
>Thanks.
>
>
Hi Martin,
Can you post a diff against 2.6.19?
Thanks,
Steve
--
"They that give up essential liberty to obtain temporary safety,
deserve neither liberty nor safety." (Ben Franklin)
"The course of history shows that as a government grows, liberty
decreases." (Thomas Jefferson)
On Fri, Dec 29, 2006 at 12:58:12AM -0800, Linus Torvalds wrote:
> Because what "__set_page_dirty_buffers()" does is that AT THE TIME THE
> "set_page_dirty()" IS CALLED, it will mark all the buffers on that page as
> dirty. That may _sound_ like what we want, but it really isn't. Because by
> the time "writepage()" is actually called (which can be MUCH MUCH later),
> some internal filesystem activity may actually have cleaned one or more of
> those buffers in the meantime, and now we call "writepage()" (which really
> wants to write them _all_), and it will write only part of them, or none
> at all.
I'm confused. Does this mean that if "fs blocksize"=="VM pagesize"
this bug can't trigger? But I thought at least one of people
reporting corruption was using a filesystem with a 4k block size on an
i386?
- Ted
* Stephen Clark <[email protected]> [2006-12-29 10:17]:
> >It works for me now, both your testcase as well as an installation of
> >Debian on this ARM device. I manually applied the patch to 2.6.19.
>
> Can you post a diff against 2.6.19?
--- a/mm/page-writeback.c 2006-11-29 21:57:37.000000000 +0000
+++ b/mm/page-writeback.c 2006-12-29 11:02:55.555147896 +0000
@@ -893,16 +893,45 @@
{
struct address_space *mapping = page_mapping(page);
- if (mapping) {
+ if (mapping && mapping_cap_account_dirty(mapping)) {
+ /*
+ * Yes, Virginia, this is indeed insane.
+ *
+ * We use this sequence to make sure that
+ * (a) we account for dirty stats properly
+ * (b) we tell the low-level filesystem to
+ * mark the whole page dirty if it was
+ * dirty in a pagetable. Only to then
+ * (c) clean the page again and return 1 to
+ * cause the writeback.
+ *
+ * This way we avoid all nasty races with the
+ * dirty bit in multiple places and clearing
+ * them concurrently from different threads.
+ *
+ * Note! Normally the "set_page_dirty(page)"
+ * has no effect on the actual dirty bit - since
+ * that will already usually be set. But we
+ * need the side effects, and it can help us
+ * avoid races.
+ *
+ * We basically use the page "master dirty bit"
+ * as a serialization point for all the different
+ * threds doing their things.
+ *
+ * FIXME! We still have a race here: if somebody
+ * adds the page back to the page tables in
+ * between the "page_mkclean()" and the "TestClearPageDirty()",
+ * we might have it mapped without the dirty bit set.
+ */
+ if (page_mkclean(page))
+ set_page_dirty(page);
if (TestClearPageDirty(page)) {
- if (mapping_cap_account_dirty(mapping)) {
- page_mkclean(page);
- dec_zone_page_state(page, NR_FILE_DIRTY);
- }
+ dec_zone_page_state(page, NR_FILE_DIRTY);
return 1;
}
return 0;
- }
+ }
return TestClearPageDirty(page);
}
EXPORT_SYMBOL(clear_page_dirty_for_io);
--
Martin Michlmayr
http://www.cyrius.com/
On Fri, 29 Dec 2006, Nick Piggin wrote:
>
> > It still has a tiny tiny race (see the comment), but I bet nobody can really
> > hit it in real life anyway, and I know several ways to fix it, so I'm not
> > really _that_ worried about it.
>
> Well the race isn't a data loss one, is it? Just a case where the
> pte may be dirty but the page dirty state not accounted for.
Right. We should be picking it up eventually, since it's still in the page
tables, but if we've lost sight of the page dirtyness we won't react
correctly to msync() and/or fdatasync(). So we don't _lose_ the data, we
just might not write it out in a timely manner if we ever hit the race.
> Can we fix it by just putting the page_mkclean back inside the
> TestClearPageDirty check, and re-clearing PG_dirty after redoing
> the set_page_dirty?
I considered it, but quite frankly, if we did it that way, I'd really like
to just fix the whole insane "set_page_dirty()" instead.
I think set_page_dirty() should be split up. One thing that confused me
mentally was that almost all of the dirty handling was actualyl done only
if PG_dirty wasn't already set, so the _bulk_ of set_page_dirty() really
ends up being
if (!TestSetPageDirty(page)) {
.. we just marked the page dirty, it was clean before,
so we need to add it to the queues etc ..
}
and that's the part that I (and probably others) always really thought
about.
But then we have the _one_ thing that runs outside of that "do only once
per dirty bit" logic, and it's the buffer dirtying. If we had had two
separate operations for this all: "set_dirty_every_time()" and the regular
"set_dirty()", I don't think this would have been nearly as confusing.
(And then the difference between "__set_page_dirty_nobuffers()" and
"__set_page_dirty_buffers()" really boils down to one doing the
"everytime" _and_ the "once per dirty" checks and the other one doing just
the "once per dirty bit" act - and we could rename the damn things to
something saner too).
If we split it up that way, then the whole clear_page_dirty_for_io() logic
would boil down to
if (TestClearPageDirty(page)) {
if (page_mkclean(page))
set_dirty_every_time();
return 1;
}
return 0;
and we wouldn't even need to do any of the "clear dirty again" kind of
idiocy, because the "set_dirty_every_time()" stuff is the one that doesn't
even care about the state of the PG_dirty bit - it's done regardless, and
doesn't really touch it.
That's what I wanted to do, but with the current "set_page_dirty()" setup,
I think my patch makes reasonable sense.
Linus
Linus Torvalds a ?crit :
> going back to Linux-2.6.5 at least, according to one tester).
>
I apologize for the confusion, but it just occurred to me that I was
actually
experiencing a totally different problem: I set a root filesystem of
3Mib for
qemu, so the test program just didn't have enough space for its file.
--
Guillaume
On Fri, 29 Dec 2006, Theodore Tso wrote:
>
> I'm confused. Does this mean that if "fs blocksize"=="VM pagesize"
> this bug can't trigger?
No. Even if there is just a single buffer-head, if the filesystem ever
writes out that _single_ buffer-head out of turn (ie before the VM
actually asks it to, with "->writepage()"), then the same issue will
happen.
In fact, a bigger fs blocksize will likely just make this easier to
trigger (although I doubt it makes a big difference), since any
out-of-order buffer flushback will happen for the whole page, rather than
just a part of the page.
So the "problem" really ends up being that the filesystem does flushing
that the VM isn't aware of, so when the VM did "set_page_dirty()" at an
earlier time, the VM _expected_ the "->writepages()" call that happened
much later to write the whole page - but because the FS had flushed things
behind it backs even _before_ the "->writepage" happens, by the time the
VM actually asks for the page to be written out, the FS layer won't
actually write it all out any more.
Blocksize doesn't matter, the only thing that matters is whether something
writes out data on a buffer-cache level, not on a "page cache" level. Ext3
apparently does this in "ordered" data more at least (and hey, I suspect
that the code that tries to release buffer head data might try to do it on
its own too).
Linus
On Fri, Dec 29, 2006 at 10:02:53AM -0500, Dave Jones wrote:
> On Fri, Dec 29, 2006 at 10:23:14AM +0100, maximilian attems wrote:
> > > On Thu, Dec 28, 2006 at 11:21:21AM -0800, Linus Torvalds wrote:
<snipp>
> > > > > That was a Fedora kernel. Has anyone seen the corruption in vanilla 2.6.18
> > > > > (or older)?
> > > >
> > > > Well, that was a really _old_ fedora kernel. I guarantee you it didn't
> > > > have the page throttling patches in it, those were written this summer. So
> > > > it would either have to be Fedora carrying around another patch that just
> > > > happens to result in the same corruption for _years_, or it's the same
> > > > bug.
> > >
> > > The only notable VM patch in Fedora kernels of that vintage that I recall
> > > was Ingo's 4g/4g thing.
> >
> > no the fedora 2.6.18 kernel is affected.
>
> I wasn't denying that, but Linus was talking about a 2.6.5 Fedora kernel.
>
> > it carries the same -mm patches that Debian backported
> > for LSB 3.1 compliance.
>
> The only -mm stuff I recall being in the Fedora 2.6.18 is
> the inode-diet stuff which ended up in 2.6.19, though the xmas
> break has left my head somewhat empty so I may be forgetting something.
> What patch in particular are you talking about?
it's no longer visible in the FC6 cvs, due to rebase
but it's name was linux-2.6-mm-tracking-dirty-pages.patch
it is an earlier almagame of the merged patch serie:
- mm: tracking shared dirty pages
- mm: balance dirty pages
- mm: optimize the new mprotect() code a bit
- mm: small cleanup of install_page()
- mm: fixup do_wp_page()
- mm: msync() cleanup (closes: #394392)
--
maks
On Fri, Dec 29, 2006 at 07:52:15PM +0100, maximilian attems wrote:
> > The only -mm stuff I recall being in the Fedora 2.6.18 is
> > the inode-diet stuff which ended up in 2.6.19, though the xmas
> > break has left my head somewhat empty so I may be forgetting something.
> > What patch in particular are you talking about?
>
> it's no longer visible in the FC6 cvs, due to rebase
> but it's name was linux-2.6-mm-tracking-dirty-pages.patch
> it is an earlier almagame of the merged patch serie:
> - mm: tracking shared dirty pages
> - mm: balance dirty pages
> - mm: optimize the new mprotect() code a bit
> - mm: small cleanup of install_page()
> - mm: fixup do_wp_page()
> - mm: msync() cleanup (closes: #394392)
Ohh, that. Yes. I had forgotten all about that.
I've been hitting the nog a little too hard :)
Dave
--
http://www.codemonkey.org.uk
On Fri, 29 Dec 2006 02:48:35 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
> + if (mapping && mapping_cap_account_dirty(mapping)) {
> + /*
> + * Yes, Virginia, this is indeed insane.
> + *
> + * We use this sequence to make sure that
> + * (a) we account for dirty stats properly
> + * (b) we tell the low-level filesystem to
> + * mark the whole page dirty if it was
> + * dirty in a pagetable. Only to then
> + * (c) clean the page again and return 1 to
> + * cause the writeback.
> + *
> + * This way we avoid all nasty races with the
> + * dirty bit in multiple places and clearing
> + * them concurrently from different threads.
> + *
> + * Note! Normally the "set_page_dirty(page)"
> + * has no effect on the actual dirty bit - since
> + * that will already usually be set. But we
> + * need the side effects, and it can help us
> + * avoid races.
> + *
> + * We basically use the page "master dirty bit"
> + * as a serialization point for all the different
> + * threds doing their things.
> + *
> + * FIXME! We still have a race here: if somebody
> + * adds the page back to the page tables in
> + * between the "page_mkclean()" and the "TestClearPageDirty()",
> + * we might have it mapped without the dirty bit set.
> + */
> + if (page_mkclean(page))
> + set_page_dirty(page);
> + if (TestClearPageDirty(page)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> + return 1;
> }
- Presumably reiser3's ordered-data mode has the same problem. And ext4,
of course. Dunno about other filesytems.
- The above change means that we do extra writeout. If a page is dirtied
once, kjournald will write it and then pdflush will come along and
needlessly write it again.
But otoh, if a mapping is being repeatedly dirtied, kjournald will
write the page once per 30 seconds (dirty_expire_centisecs) and pdflush
will write the page once per 30 seconds as well. But we _should_ be
writing it once per five seconds (kjournald commit interval). So we're
still ahead ;)
- Poor old IO accounting broke again.
- People were saying that ext2 and ext3,data=writeback were also showing
corruption. What's up with that?
- For a long time I've wanted to nuke the current ext3/jbd ordered-data
implementation altogether, and just make kjournald call into the
standard writeback code to do a standard suberblock->inodes->pages walk.
I think it'd be fairly straightforward to do. We'd need to teach the
writeback code to be able to skip dirty pages which don't have a disk
mapping, so that kjournald doesn't end up waiting for kjournald to free
up journal space..
Would need to avoid possible deadlocks where someone calls
ext3_force_commit() or otherwise does a synchronous commit while holding
VFS locks.
reiser3 and ext4 could be converted too.
Not a short-term project, but this would avoid the problem.
- It's pretty obnoxious that the VM now sets a clean page "dirty" and
then proceeds to modify its contents. It would be nice to stop doing
that.
We could stop marking the page dirty in do_wp_page() and create a new
VM counter "NR_PTE_DIRTY", which means
"number of mapping_cap_account_dirty() pages which have a dirty pte
pointing at them".
Or, perhaps
"number of dirty ptes which point at mapping_cap_account_dirty() pages".
Which can be larger, but the writeout code will probably cope.
Then we take NR_PTE_DIRTY into account in the dirty-page balancing act.
So
- do_wp_page() will still run balance_dirty_pages()
- but it would no longer run set_page_dirty().
- But it needs to run mark_inode_dirty() so the fs-writeback code
notices the file.
- And mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) becomes insufficient.
The tricky part here is "how do we do the writeback"? The
pte-dirty,!PageDirty pages aren't tagged as dirty in the radix-tree and
writeback needs to find them so that it can effectively do an msync() on
them. Walking all the mm's and vma's would be insane. Visiting all the
pages in the file would also probably be insane.
Perhaps this can be solved by adding a new radix-tree tag which means
"this page might have dirty ptes pointing at it". For each file
writeback would do a radix-tree walk of these pages,
cleaning-and-write-protecting ptes, marking the corresponding pages
dirty and clearing their PAGECACHE_TAG_PTE_DIRTY tags.
Then we can fix the mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)
problem by doing
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) ||
mapping_tagged(mapping, PAGECACHE_TAG_PTE_DIRTY)
or, better,
mapping_tagged(mapping,
(1<<PAGECACHE_TAG_DIRTY)|(1<<PAGECACHE_TAG_PTE_DIRTY))
perhaps.
The msync() code would need to be taught to call the
PAGECACHE_TAG_PTE_DIRTY walker for the appropriate page range.
This is also not a quick-fix.
On Fri, 29 Dec 2006 14:16:32 -0800
Andrew Morton <[email protected]> wrote:
> - Poor old IO accounting broke again.
No it didn't - we're relying upon the behaviour of __set_page_dirty_buffers()
against an already-dirty page.
On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> - The above change means that we do extra writeout. If a page is dirtied
> once, kjournald will write it and then pdflush will come along and
> needlessly write it again.
There's zero extra writeout for any flushing that flushes BY PAGES.
Only broken flushers that flush by buffer heads (which really really
really shouldn't be done any more: welcome to the 21st century) will cause
extra writeouts. And those extra writeouts are obviously required for all
the dirty state to actually hit the disk - which is the point of the
patch.
So they're not "extra" - they are "required for correct working".
But I can't stress the fact enough that people SHOULD NOT do writeback by
buffer heads. The buffer head has been purely an "IO entity" for the last
several years now, and it's not a cache entity. Anybody who does writeback
by buffer heads is basically bypassing the real cache (the page cache),
and that's why all the problems happen.
I think ext3 is terminally crap by now. It still uses buffer heads in
places where it really really shouldn't, and as a result, things like
directory accesses are simply slower than they should be. Sadly, I don't
think ext4 is going to fix any of this, either.
It's all just too inherently wrongly designed around the buffer head
(which was correct in 1995, but hasn't been correct for a long time in the
kernel any more).
> - Poor old IO accounting broke again.
No. That's why I used "set_page_dirty()" and did it that strange ugly way
("set page dirty, even though it's already dirty, and even though the very
next thing we will do is TestClearPageDirty???").
That code looks strange as a result, which is why it now has more comments
on it than actual code ;)
> - People were saying that ext2 and ext3,data=writeback were also showing
> corruption. What's up with that?
I thought the "ext3,data=writeback" case was reported to be fine by
several people?
I'm not sure about ext2. I didn't look at what it did based on buffer
heads. I would have expected it to be ok.
That said, at least one report was later shown to be bogus (errors due to
out of disk, not due to actual errors ;).
> - For a long time I've wanted to nuke the current ext3/jbd ordered-data
> implementation altogether, and just make kjournald call into the
> standard writeback code to do a standard suberblock->inodes->pages walk.
I really would like to see less of the buffer-head-based stuff, and yes,
more of the normal inode page walking. I don't think you can "order"
accesses within a page anyway, exactly because of memory mapping issues,
so any page ordering is not about buffer heads on the page itself, it
should be purely about metadata.
> - It's pretty obnoxious that the VM now sets a clean page "dirty" and
> then proceeds to modify its contents. It would be nice to stop doing
> that.
No. I think this really the fundamental confusion people had. People
thought that setting the page dirty meant that it was no longer being
modified. It hasn't meant that in a LONG time - ever since the whole
DIRTY_TAG thing, the most important part of the PG_dirty thing has really
been that it's now efficiently findable by the writeout logic.
And that is very much what the whole page accounting _depends_ on. When we
mmap a page, we need to mark it "findable" as dirty _before_ people
actually start writing to it, because it's too late afterwards.
> We could stop marking the page dirty in do_wp_page() and create a new
> VM counter "NR_PTE_DIRTY", which means
>
> "number of mapping_cap_account_dirty() pages which have a dirty pte
> pointing at them".
Well, then you need to change what PAGE_MAPPING_TAG_DIRTY means too.
That's very fundamental. That DIRTY _tag_ is now even more important than
the PG_dirty bit itself, since that's what we actually use to _access_
those things.
Linus
On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> I think ext3 is terminally crap by now. It still uses buffer heads in
> places where it really really shouldn't, and as a result, things like
> directory accesses are simply slower than they should be. Sadly, I don't
> think ext4 is going to fix any of this, either.
Not just ext3; ocfs2 is using the jbd layer as well. I think we're
going to have to put this (a rework of jbd2 to use the page cache) on
the ext4 todo list, and work with the ocfs2 folks to try to come up
with something that suits their needs as well. Fortunately we have
this filesystem/storage summit thing coming up in the next few months,
and we can try to get some discussion going on the linux-ext4 mailing
list in the meantime. Unfortunately, I don't think this is going to
be trivial.
If we do get this fixed for ext4, one interesting question is whether
people would accept a patch to backport the fixes to ext3, given the
the grief this is causing the page I/O and VM routines. OTOH, reiser3
probably has the same problems, and I suspect the changes to ext3 to
cause it to avoid buffer heads, especially in order to support for
filesystem blocksizes < pagesize, are going to be sufficiently risky
in terms of introducing regressions to ext3 that they would probably
be rejected on those grounds. So unfortunately, we probably are going
to have to support flushes via buffer heads for the foreseeable
future.
- Ted
On Fri, 29 Dec 2006 14:42:51 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Fri, 29 Dec 2006, Andrew Morton wrote:
> >
> > - The above change means that we do extra writeout. If a page is dirtied
> > once, kjournald will write it and then pdflush will come along and
> > needlessly write it again.
>
> There's zero extra writeout for any flushing that flushes BY PAGES.
>
> Only broken flushers that flush by buffer heads (which really really
> really shouldn't be done any more: welcome to the 21st century) will cause
> extra writeouts. And those extra writeouts are obviously required for all
> the dirty state to actually hit the disk - which is the point of the
> patch.
>
> So they're not "extra" - they are "required for correct working".
They're extra. As in "can be optimised away".
> But I can't stress the fact enough that people SHOULD NOT do writeback by
> buffer heads. The buffer head has been purely an "IO entity" for the last
> several years now, and it's not a cache entity.
The buffer_head is not an IO container. It is the kernel's core
representation of a disk block. Usually (but not always) it is backed by
some memory which is in pagecache. We can feed buffer_heads into IO
containers via submit_bh(), but that's far from the only thing we use
buffer_heads for. We should have done s/buffer_head/block/g years ago.
JBD implements physical block-based journalling, so it is 100% appropriate
that JBD deal with these disk blocks using their buffer_head
representation.
That being said, ordered-data mode isn't really part of the JBD journalling
system at all (the data doesn't get journalled!) - ordered-mode is an
add-on to the JBD journal to make the metadata which we're about to journal
point at more-likely-to-be-correct data.
JBD's ordered-mode writeback is just a sync and I see no conceptual
problems with killing its old buffer_head based sync and moving it into the
21st century.
> Anybody who does writeback
> by buffer heads is basically bypassing the real cache (the page cache),
> and that's why all the problems happen.
>
> I think ext3 is terminally crap by now. It still uses buffer heads in
> places where it really really shouldn't,
The ordered-data mode flush: sure. The rest of JBD's use of buffer_heads
is quite appropriate.
> and as a result, things like
> directory accesses are simply slower than they should be. Sadly, I don't
> think ext4 is going to fix any of this, either.
I thought I fixed the performance problem?
Somewhat nastily, but as ext3 directories are metadata it is appropriate
that modifications to them be done in terms of buffer_heads (ie: blocks).
> It's all just too inherently wrongly designed around the buffer head
> (which was correct in 1995, but hasn't been correct for a long time in the
> kernel any more).
>
> > - Poor old IO accounting broke again.
>
> No. That's why I used "set_page_dirty()" and did it that strange ugly way
> ("set page dirty, even though it's already dirty, and even though the very
> next thing we will do is TestClearPageDirty???").
nfs_set_page_dirty() and reiserfs_set_page_dirty() should now bail if
PageDirty() to avoid needless work.
> > - For a long time I've wanted to nuke the current ext3/jbd ordered-data
> > implementation altogether, and just make kjournald call into the
> > standard writeback code to do a standard suberblock->inodes->pages walk.
>
> I really would like to see less of the buffer-head-based stuff, and yes,
> more of the normal inode page walking. I don't think you can "order"
> accesses within a page anyway, exactly because of memory mapping issues,
> so any page ordering is not about buffer heads on the page itself, it
> should be purely about metadata.
In this context ext3's "ordered" mode means "sync the file contents before
journalling the metadata which points at it".
> > - It's pretty obnoxious that the VM now sets a clean page "dirty" and
> > then proceeds to modify its contents. It would be nice to stop doing
> > that.
>
> No. I think this really the fundamental confusion people had. People
> thought that setting the page dirty meant that it was no longer being
> modified.
No. Setting a page (or bh, or inode) dirty means "this is known to have
been modified". ie: this cached entity is now out of sync with backing
store.
Ho hum. I don't care much, really. But then, I understand how all this
stuff works. Try explaining to someone the relationship between
pte-dirtiness, page-dirtiness, radix-tree-dirtiness and
buffer_head-dirtiness.
On Fri, 29 Dec 2006, Theodore Tso wrote:
>
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.
I don't think backporting is the smartest option (unless it's done _way_
later), but the real problem with it isn't actually the VM behaviour, but
simply the fact that cached performance absoluely _sucks_ with the buffer
cache.
With the physically indexed buffer cache thing, you end up always having
to do these complicated translations into block numbers for every single
access, and at some point when I benchmarked it, it was a huge overhead
for doing simple things like readdir.
It's also a major pain for read-ahead, exactly partly due to the high cost
of translation - because you can't cheaply check whether the next block is
there, the cost of even asking the question "should I try to read ahead?"
is much much higher. As a result, read-ahead is seriously limited, because
it's so expensive for the cached case (which is still hopefully the
_common_ case).
So because read-ahead is limited, the non-cached case then _really_ sucks.
It was somewhat fixed in a really god-awful fashion by having
ext3_readdir() actually do _readahead_ though the page cache, even though
it does everything else through the buffer cache. And that just happens to
work because we hopefully have physically contiguous blocks, but when that
isn't true, the readahead doesn't do squat.
It's really quite fundamentally broken. But none of that causes any
problems for the VM, since directories cannot be mmap'ed anyway. But it's
really pitiful, and it really doesn't work very well. Of course, other
filesystems _also_ suck at this, and other operating systems haev even
MORE problems, so people don't always seem to realize how horribly
horribly broken this all is.
I really wish somebody would write a filesystem that did large cold-cache
directories well. Open some horrible file manager on /usr/bin with cold
caches, and weep. The biggest problem is the inode indirection, but at
some point when I looked at why it sucked, it was doing basically
synchronous single-buffer reads on the directory too, because readahead
didn't work properly.
I was hoping that something like SpadFS would actually take off, because
it seemed to do a lot of good design choices (having inodes in-line in the
directory for when there are no hardlinks is probably a requirement for a
good filesystem these days. The separate inode table had its uses, but
indirection in a filesystem really does suck, and stat information is too
important to be indirect unless it absolutely has to).
But I suspect it needs more than somebody who just wants to get his thesis
written ;)
Linus
On Fri, 29 Dec 2006 18:32:07 -0500
Theodore Tso <[email protected]> wrote:
> On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> > I think ext3 is terminally crap by now. It still uses buffer heads in
> > places where it really really shouldn't, and as a result, things like
> > directory accesses are simply slower than they should be. Sadly, I don't
> > think ext4 is going to fix any of this, either.
>
> Not just ext3; ocfs2 is using the jbd layer as well. I think we're
> going to have to put this (a rework of jbd2 to use the page cache) on
> the ext4 todo list, and work with the ocfs2 folks to try to come up
> with something that suits their needs as well. Fortunately we have
> this filesystem/storage summit thing coming up in the next few months,
> and we can try to get some discussion going on the linux-ext4 mailing
> list in the meantime. Unfortunately, I don't think this is going to
> be trivial.
I suspect it would be insane to move any part of JBD (apart from the
ordered-data flush) to use pagecache. The whole thing is fundamentally
block-based. But only for metadata - there's no strong reason why ext3/4
needs to manipulate file data via buffer_heads if data=journal and chattr
+j aren't in use.
We could possibly move ext3/4 directories out of the blockdev pagecache and
into per-directory pagecache, but that wouldn't change anything - the
journalling would still be block-based.
Adam Richter spent considerable time a few years ago trying to make the
mpage code go direct-to-BIO in all cases and we eventually gave up. The
conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
and ugly to fully optimise away the "block" bit in the middle.
buffer_heads become more important with large PAGE_CACHE_SIZE. I'd expect
nobh mode to be quite inefficient with some workloads on 64k pages. We
need that representation of the state (and location) of the block-sized
hunks which make up the page.
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines. OTOH, reiser3
> probably has the same problems, and I suspect the changes to ext3 to
> cause it to avoid buffer heads, especially in order to support for
> filesystem blocksizes < pagesize, are going to be sufficiently risky
> in terms of introducing regressions to ext3 that they would probably
> be rejected on those grounds. So unfortunately, we probably are going
> to have to support flushes via buffer heads for the foreseeable
> future.
We'll see.
On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> They're extra. As in "can be optimised away".
Sure. Don't use buffer heads.
> The buffer_head is not an IO container. It is the kernel's core
> representation of a disk block.
Please come back from the 90's.
The buffer heads are nothing but a mapping of where the hardware block is.
If you use it for anything else, you're basically screwed.
> JBD implements physical block-based journalling, so it is 100% appropriate
> that JBD deal with these disk blocks using their buffer_head
> representation.
And as long as it does that, you just have to face the fact that it's
going to perform like crap, including what you call "extra" writes, and
what I call "deal with it".
Btw, you can make pages be physically indexed too, but they obviously
(a) won't be coherent with any virtual mapping laid on top of it
(b) will be _physical_, so any readahead etc will be based on physical
addresses too.
> I thought I fixed the performance problem?
No, you papered over it, for the reasonably common case where things were
physically contiguous - exactly by using a physical page cache, so now it
can do read-ahead based on that. Then, because the pages contain buffer
heads, the directory accesses can look up buffers, and if it was all
physically contiguous, it all works fine.
But if you actually want virtualluy indexed caching (and all _users_ want
it), it really doesn't work.
> Somewhat nastily, but as ext3 directories are metadata it is appropriate
> that modifications to them be done in terms of buffer_heads (ie: blocks).
No. There is nothing "appropriate" about using buffer_heads for metadata.
It's quite proper - and a hell of a lot more efficient - to use virtual
page-caching for metadata too.
Look at the ext2 readdir() implementation, and compare it to the crapola
horror that is ext3. Guess what? ext2 uses virtually indexed metadata, and
as a result it is both simpler, smaller and a LOT faster than ext3 in
accessing that metadata.
Face it, Andrew, you're wrong on this one. Really. Just take a look at
ext2_readdir().
[ I'm not saying that ext2_readdir() is _beautiful_. If it had been
written with the page cache in mind, it would probably have been done
very differently. And it doesn't do any readahead, probably because
nobody cared enough, but it should be trivial to add, and it would
automatically "do the right thing" just because it's much easier at the
page cache level.
But I _am_ saying that compared to ext3, the ext2 readdir is a work of
art. ]
"metadata" has _zero_ to do with "physically indexed". There is no
correlation what-so-ever. If you think there is a correlation, it's all in
your mind.
Linus
On Fri, 29 Dec 2006 16:11:44 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> > JBD implements physical block-based journalling, so it is 100% appropriate
> > that JBD deal with these disk blocks using their buffer_head
> > representation.
>
> And as long as it does that, you just have to face the fact that it's
> going to perform like crap, including what you call "extra" writes, and
> what I call "deal with it".
It is quite tiresome to delete things which your interlocutor said and to
then restate them as if it were some sort of relevation.
> > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > that modifications to them be done in terms of buffer_heads (ie: blocks).
>
> No. There is nothing "appropriate" about using buffer_heads for metadata.
I said "modification".
> [stuff about directory reads elided]
On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> Adam Richter spent considerable time a few years ago trying to make the
> mpage code go direct-to-BIO in all cases and we eventually gave up. The
> conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
> and ugly to fully optimise away the "block" bit in the middle.
Using the buffer cache as a translation layer to the physical address is
fine. That's what _any_ block device will do.
I'm not at all sayign that "buffer heads must go away". They work fine.
What I'm saying is that
- if you index by buffer heads, you're screwed.
- if you do IO by starting at buffer heads, you're screwed.
Both indexing and writeback decisions should be done at the page cache
layer. Then, when you actually need to do IO, you look at the buffers. But
you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.
So by all means keep the buffer heads as a way to keep the
"virtual->physical" translation. It's what they were designed for. But
they were _originally_ also designed for "lookup" and "driving the start
of IO", and that is wrong, and has been wrong for a long time now, because
- lookup based on physical address is fundamentally slow and inefficient.
You have to look up the virtual->physical translation somewhere else,
so it's by design an unnecessary indirection _and_ that "somewere
else" is also by definition filesystem-specific, so you can't do any
of these things at the VFS layer.
Ergo: anything that needs to look up the physical address in order to
find the buffer head is BROKEN in this day and age. We look up the
_virtual_ page cache page, and then we can trivially find the buffer
heads within that page thanks to page->buffers.
Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't.
- starting IO based on the physical entity is insane. It's insane exactly
_because_ the VM doesn't actually think in physical addresses, or in
buffer-sized blocks. The VM only really knows about whole pages, and
all the VM decisions fundamentally have to be page-based. We don't ever
"free a buffer". We free a whole page, and as such, doing writeback
based on buffers is pointless, because it doesn't actually say anything
about the "page state" which is what the VM tracks.
But neither of these means that "buffer_head" itself has to go away. They
both really boil down to the same thing: you should never KEY things by
the buffer head. All actions should be based on virtual indexes as far as
at all humanly possible.
Once you do lookup and locking and writeback _starting_ from the page,
it's then easy to look up the actual buffer head within the page, and use
that as a way to do the actual _IO_ on the physical address. So the buffer
heads still exist in ext2, for example, but they don't drive the show
quite as much.
(They still do in some areas: the allocation bitmaps, the xattr code etc.
But as long as none of those have big VM footprints, and as long as no
_common_ operations really care deeply, and as long as those data
structures never need to be touched by the VM or VFS layer, nobody will
ever really care).
The directory case comes up just because "readdir()" actually is very
common, and sometimes very slow. And it can have a big VM working set
footprint ("find"), so trying to be page-based actually really helps,
because it all drives things like writeback on the _right_ issues, and we
can do things like LRU's and writeback decisions on the level that really
matters.
I actually suspect that the inode tables could benefit from being in the
page cache too (although I think that the inode buffer address is actually
"physical", so there's no indirection for inode tables, which means that
the virtual vs physical addressing doesn't matter). For directories, there
definitely is a big cost to continually doing the virtual->physical
translation all the time.
Linus
On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> > > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > > that modifications to them be done in terms of buffer_heads (ie: blocks).
> >
> > No. There is nothing "appropriate" about using buffer_heads for metadata.
>
> I said "modification".
You said "metadata".
Why do you think directories are any different from files? Yes, they are
metadata. So what? What does that have to do with anything?
They should still use virtual indexes, the way files do. That doesn't
preclude them from using buffer-heads to mark their (partial-page)
modifications and for keeping the virtual->physical translations cached.
I mean, really. Look at ext2. It does exactly that. It keeps the
directories in the page cache - virtually indexed. And it even modifies
them there. Exactly the same way it modifies regular file data.
It all works exactly the same way it works for regular files. It uses
page->mapping->a_ops->prepare_write(NULL, page, from, to);
... do modification ...
ext2_commit_chunk(page, from, to);
exactly the way regular file data works.
That's why I'm saying there is absolutely _zero_ thing about "metadata"
here, or even about "modifications". It all works better in a virtual
cache, because you get all the support that we give to page caches.
So I really don't understand why you make excuses for ext3 and talk about
"modifications" and "metadata". It was a fine design ten years ago. It's
not really very good any longer.
I suspect we're stuck with the design, but that doesn't make it any
_better_.
Linus
On Fri, 29 Dec 2006 16:58:41 -0800 (PST)
Linus Torvalds <[email protected]> wrote:
>
>
> On Fri, 29 Dec 2006, Andrew Morton wrote:
> >
> > > > Somewhat nastily, but as ext3 directories are metadata it is appropriate
> > > > that modifications to them be done in terms of buffer_heads (ie: blocks).
> > >
> > > No. There is nothing "appropriate" about using buffer_heads for metadata.
> >
> > I said "modification".
>
> You said "metadata".
>
> Why do you think directories are any different from files? Yes, they are
> metadata. So what? What does that have to do with anything?
We journal the contents of directories. Fully. So we handle their dirty
data at the block (ie: buffer_head) level. When someone tries to dirty
part of a directory we need to cheat and not mark that part of the page as
dirty and we need to then write the block to the journal and then mark the
block as really dirty for checkpointing (but still attached to the journal)
and all that goop.
The regular page-based writeback doesn't apply until the block has been
written to the journal. At that stage the block is considered dirty
against its real position on disk. It will then be written back by pdflush
via the blockdev inode -> blkdev_writepage(). Unless kjournald needs to do
an early flush to reclaim the journal space, in which case kjournald will
write the block itself.
>
> So I really don't understand why you make excuses for ext3 and talk about
> "modifications" and "metadata". It was a fine design ten years ago. It's
> not really very good any longer.
>
As I said in another apparently-neglected email:
: We could possibly move ext3/4 directories out of the blockdev pagecache and
: into per-directory pagecache, but that wouldn't change anything - the
: journalling would still be block-based.
We already have all the code in place to journal blocks which are cached in
an address_space other than the blockdev inode's: ext3_journalled_aops.
On Fri, Dec 29, 2006 at 01:19:46PM +0100, Ingo Molnar wrote:
> i've extended the tracer in -rt to trace all relevant pagetable,
> pagecache, buffer-cache and IO events and coupled the tracer to your
> test.c code. The corruption happens here:
>
> test-2126 0.... 3756170us+: trace_page (cf20ebd8 b6a2c000 0)
> pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420)
> test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420)
> test-2126 0D... 9012933us+: do_page_fault (8048900 4 b6a2c000)
> test-2126 0.... 9023278us+: trace_page (cf262f24 b6a2c000 0)
> test-2126 0.... 9023305us > sys_prctl (000000d8 b6a2c000 000000ac)
This tracer definitly looks interesting. Could you send a splitout
patch with it to lkml for review?
[Cc:-ed lkml]
* Christoph Hellwig <[email protected]> wrote:
> On Fri, Dec 29, 2006 at 01:19:46PM +0100, Ingo Molnar wrote:
> > i've extended the tracer in -rt to trace all relevant pagetable,
> > pagecache, buffer-cache and IO events and coupled the tracer to your
> > test.c code. The corruption happens here:
> >
> > test-2126 0.... 3756170us+: trace_page (cf20ebd8 b6a2c000 0)
> > pdflush-2006 0.... 6432909us+: trace_page (cf20ebd8 b6a2c000 4200420)
> > test-2126 0.... 8135596us+: trace_page (cf20ebd8 b6a2c000 4200420)
> > test-2126 0D... 9012933us+: do_page_fault (8048900 4 b6a2c000)
> > test-2126 0.... 9023278us+: trace_page (cf262f24 b6a2c000 0)
> > test-2126 0.... 9023305us > sys_prctl (000000d8 b6a2c000 000000ac)
>
> This tracer definitly looks interesting. Could you send a splitout
> patch with it to lkml for review?
Find it below - it's ontop of the tracer included in 2.6.20-rc2-rt3.
it's very ad-hoc, based on Linus' test utility. I can write such a
tracer in 30 minutes so i usually throw them away. I literally wrote
dozens of tracer variants for specific bugs in the past few years.
Note: this particular one tracks page contents as well from
kernel-space, that's how i was able to see where the corruption
happened. That assumes that there's no highmem on the box. Also, the pte
value tracking portion is only for i386 - etc. etc. Note: for the bug to
be visible i didnt need the per-page tracking portion of the tracer -
the key was to track page contents, and to track how virtual addresses
map to physical pages, and how their IO happens.
This patch is /not/ for merging: this patch too undescores my years long
experience that static tracepoints included in the generic kernel are
just pointless in general - i dont want to see such cruft in the kernel,
and they amass with time. The union of all ad-hoc tracing hacks i had in
the past would be thousands of static tracepoints - and that's just
/me/. If we pick only a handful they wont help us find the most
difficult bugs and they'll only create additional 'demand' for 'more' -
leading to an endless fight.
The best method i think is to use the source code itself (Linus used
printks) - or if any infrastructure is to be used then ad-hoc
"scriptlets" via SystemTap can find the really difficult bugs - and in
the long run systemtap suits that purpose best. If systemtap were
ubiquous we could have sent scriptlets to users who experienced the
bugs, for them to install them dynamically. Systemtap makes it plain
obvious that tracepoints are 1) detached from the source code and are 2)
are temporary and ad-hoc in nature. It doesnt create undue pressure to
include more and more static tracepoints.
Ingo
----------->
fs/buffer.c | 1
include/asm-i386/pgtable-2level.h | 4 +
include/linux/mm_types.h | 22 +++++++++
kernel/sys.c | 15 ++++++
mm/Makefile | 2
mm/memory.c | 2
mm/page-writeback.c | 33 ++++++++++++--
mm/page_alloc.c | 3 +
mm/page_trace.c | 84 ++++++++++++++++++++++++++++++++++++++
mm/rmap.c | 2
10 files changed, 159 insertions(+), 9 deletions(-)
Index: linux/fs/buffer.c
===================================================================
--- linux.orig/fs/buffer.c
+++ linux/fs/buffer.c
@@ -1590,6 +1590,7 @@ static int __block_write_full_page(struc
int nr_underway = 0;
BUG_ON(!PageLocked(page));
+ trace_page(page, blocksize);
last_block = (i_size_read(inode) - 1) >> inode->i_blkbits;
Index: linux/include/asm-i386/pgtable-2level.h
===================================================================
--- linux.orig/include/asm-i386/pgtable-2level.h
+++ linux/include/asm-i386/pgtable-2level.h
@@ -13,7 +13,9 @@
*/
#ifndef CONFIG_PARAVIRT
#define set_pte(pteptr, pteval) (*(pteptr) = pteval)
-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
+struct mm_struct;
+extern void trace_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte_val);
+#define set_pte_at(mm,addr,ptep,pteval) trace_set_pte_at(mm,addr,ptep,pteval)
#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
#endif
Index: linux/include/linux/mm_types.h
===================================================================
--- linux.orig/include/linux/mm_types.h
+++ linux/include/linux/mm_types.h
@@ -5,9 +5,29 @@
#include <linux/threads.h>
#include <linux/list.h>
#include <linux/spinlock.h>
+#include <linux/stacktrace.h>
struct address_space;
+struct page;
+struct seq_file;
+
+#define PAGE_TRACE_DEPTH 16
+#define PAGE_TRACE_NR 20
+
+struct page_trace_entry {
+ unsigned long timestamp;
+ char comm[17];
+ int pid;
+ int nr_entries;
+ unsigned long info;
+ unsigned long content;
+ unsigned long entries[PAGE_TRACE_DEPTH];
+};
+
+extern void trace_page(struct page *page, unsigned long info);
+extern void print_page_trace(struct seq_file *m, struct page *page);
+
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
@@ -62,6 +82,8 @@ struct page {
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
+ int trace_idx;
+ struct page_trace_entry trace[PAGE_TRACE_NR];
};
#endif /* _LINUX_MM_TYPES_H */
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c
+++ linux/kernel/sys.c
@@ -2067,6 +2067,21 @@ asmlinkage long sys_prctl(int option, un
{
long error;
+ if (option == 999) {
+ unsigned long addr = arg2;
+ struct vm_area_struct *vma = find_vma(current->mm, addr);
+ struct page *page = NULL;
+
+ printk("page trace, got addr %08lx, vma %p\n", addr, vma);
+ if (vma) {
+ page = follow_page(vma, addr, FOLL_GET);
+ if (page) {
+ print_page_trace(NULL, page);
+ put_page(page);
+ }
+ }
+ return 0;
+ }
#ifdef CONFIG_EVENT_TRACE
if (option == PR_SET_TRACING) {
if (arg2)
Index: linux/mm/Makefile
===================================================================
--- linux.orig/mm/Makefile
+++ linux/mm/Makefile
@@ -9,7 +9,7 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o
obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o \
- readahead.o swap.o truncate.o vmscan.o \
+ readahead.o swap.o truncate.o vmscan.o page_trace.o \
prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
$(mmu-y)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -451,6 +451,8 @@ struct page *vm_normal_page(struct vm_ar
* The PAGE_ZERO() pages and various VDSO mappings can
* cause them to exist.
*/
+
+ trace_page(pfn_to_page(pfn), addr);
return pfn_to_page(pfn);
}
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -762,8 +762,10 @@ int __set_page_dirty_nobuffers(struct pa
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;
- if (!mapping)
+ if (!mapping) {
+ trace_page(page, 1);
return 1;
+ }
write_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
@@ -781,8 +783,10 @@ int __set_page_dirty_nobuffers(struct pa
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
+ trace_page(page, 1);
return 1;
}
+ trace_page(page, 0);
return 0;
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -806,6 +810,7 @@ EXPORT_SYMBOL(redirty_page_for_writepage
int fastcall set_page_dirty(struct page *page)
{
struct address_space *mapping = page_mapping(page);
+ int ret;
if (likely(mapping)) {
int (*spd)(struct page *) = mapping->a_ops->set_page_dirty;
@@ -813,12 +818,17 @@ int fastcall set_page_dirty(struct page
if (!spd)
spd = __set_page_dirty_buffers;
#endif
- return (*spd)(page);
+ ret = (*spd)(page);
+ trace_page(page, ret);
+ return ret;
}
if (!PageDirty(page)) {
- if (!TestSetPageDirty(page))
+ if (!TestSetPageDirty(page)) {
+ trace_page(page, 1);
return 1;
+ }
}
+ trace_page(page, 0);
return 0;
}
EXPORT_SYMBOL(set_page_dirty);
@@ -840,6 +850,7 @@ int set_page_dirty_lock(struct page *pag
lock_page_nosync(page);
ret = set_page_dirty(page);
unlock_page(page);
+ trace_page(page, ret);
return ret;
}
EXPORT_SYMBOL(set_page_dirty_lock);
@@ -915,13 +926,17 @@ int test_clear_page_writeback(struct pag
write_lock_irqsave(&mapping->tree_lock, flags);
ret = TestClearPageWriteback(page);
- if (ret)
+ trace_page(page, ret);
+ if (ret) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
+ trace_page(page, ret);
+ }
write_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestClearPageWriteback(page);
+ trace_page(page, ret);
}
return ret;
}
@@ -936,17 +951,23 @@ int test_set_page_writeback(struct page
write_lock_irqsave(&mapping->tree_lock, flags);
ret = TestSetPageWriteback(page);
- if (!ret)
+ trace_page(page, ret);
+ if (!ret) {
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (!PageDirty(page))
+ trace_page(page, ret);
+ }
+ if (!PageDirty(page)) {
radix_tree_tag_clear(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_DIRTY);
+ trace_page(page, ret);
+ }
write_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
ret = TestSetPageWriteback(page);
+ trace_page(page, ret);
}
return ret;
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1420,6 +1420,8 @@ nopage:
show_mem();
}
got_pg:
+ if (page)
+ trace_page(page, order);
return page;
}
@@ -1468,6 +1470,7 @@ void __pagevec_free(struct pagevec *pvec
fastcall void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page)) {
+ trace_page(page, order);
if (order == 0)
free_hot_page(page);
else
Index: linux/mm/page_trace.c
===================================================================
--- /dev/null
+++ linux/mm/page_trace.c
@@ -0,0 +1,84 @@
+
+#include <linux/seq_file.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+
+void trace_page(struct page *page, unsigned long info)
+{
+ struct page_trace_entry *entry;
+ struct stack_trace trace;
+ unsigned long flags, content;
+ unsigned long *addr;
+
+ addr = (unsigned long *)page_address(page);
+ if (addr)
+ content = *addr;
+ else
+ content = 0x12344321;
+
+ trace_special((unsigned long)page, info, content);
+ trace_special_sym();
+
+ local_irq_save(flags);
+ page->trace_idx = (page->trace_idx + 1) % PAGE_TRACE_NR;
+ entry = page->trace + page->trace_idx;
+ trace.nr_entries = 0;
+ trace.max_entries = PAGE_TRACE_DEPTH;
+ trace.entries = entry->entries;
+ trace.skip = 3;
+ trace.all_contexts = 0;
+ save_stack_trace(&trace, NULL);
+ entry->nr_entries = trace.nr_entries;
+ entry->timestamp = jiffies - INITIAL_JIFFIES;
+ entry->pid = current->pid;
+ entry->info = info;
+ entry->content = content;
+ memcpy(entry->comm, current->comm, TASK_COMM_LEN);
+ local_irq_restore(flags);
+}
+
+static void print_page_trace_entry(struct seq_file *m,
+ struct page_trace_entry *entry, int idx)
+{
+ struct stack_trace trace;
+ SEQ_printf(m, "#%02d, %06ld.%03ld, %-16s:%d, (#%d): content: %08lx, info: %08lx\n",
+ idx, entry->timestamp / HZ, entry->timestamp % HZ, entry->comm, entry->pid,
+ entry->nr_entries, entry->content, entry->info);
+
+ trace.nr_entries = entry->nr_entries;
+ trace.entries = entry->entries;
+ print_stack_trace(&trace, 2);
+ SEQ_printf(m, "\n");
+}
+
+void print_page_trace(struct seq_file *m, struct page *page)
+{
+ int i, i0;
+
+ SEQ_printf(m, "printing page %p's events:\n", page);
+
+ i0 = i = page->trace_idx;
+ do {
+ i = (i + 1) % PAGE_TRACE_NR;
+ print_page_trace_entry(m, page->trace + i, i);
+ } while (i != i0);
+}
+
+static void trace_pte(pte_t pte, unsigned long addr)
+{
+ unsigned long pfn;
+
+ if (pte_present(pte)) {
+ pfn = pte_pfn(pte);
+ if (pfn_valid(pfn))
+ trace_page(pfn_to_page(pfn), addr);
+ }
+}
+
+void trace_set_pte_at(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pteval)
+{
+ trace_pte(*ptep, addr);
+ set_pte(ptep, pteval);
+ trace_pte(pteval, addr);
+}
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -452,7 +452,7 @@ static int page_mkclean_one(struct page
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
- set_pte_at(vma, address, pte, entry);
+ set_pte_at(mm, address, pte, entry);
lazy_mmu_prot_update(entry);
ret = 1;
}
On Tue, Jan 02, 2007 at 01:06:34PM +0100, Ingo Molnar wrote:
> Find it below - it's ontop of the tracer included in 2.6.20-rc2-rt3.
> it's very ad-hoc, based on Linus' test utility. I can write such a
> tracer in 30 minutes so i usually throw them away. I literally wrote
> dozens of tracer variants for specific bugs in the past few years.
Ah, I though this was a general purpose tracer. Question solved, thanks :)
I was just tired of writing my own special purpose tracers all the time
aswell.
On 12/27/06, Linus Torvalds <[email protected]> wrote:
> What would also actually be interesting is whether somebody can reproduce
> this on Reiserfs, for example. I _think_ all the reports I've seen are on
> ext2 or ext3, and if this is somehow writeback-related, it could be some
> bug that is just shared between the two by virtue of them still having a
> lot of stuff in common.
>
> Linus
I've been following this thread for a while now as I started
experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
am using reiserfs.
--
Tom Lanyon
On 1/7/07, Tom Lanyon <[email protected]> wrote:
> I've been following this thread for a while now as I started
> experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
> am using reiserfs.
However, moving to 2.6.20-rc3 does indeed seem to fix the issue thus far...
--
Tom Lanyon
On Sun, 7 Jan 2007 12:36:18 +1030
"Tom Lanyon" <[email protected]> wrote:
> On 12/27/06, Linus Torvalds <[email protected]> wrote:
> > What would also actually be interesting is whether somebody can reproduce
> > this on Reiserfs, for example. I _think_ all the reports I've seen are on
> > ext2 or ext3, and if this is somehow writeback-related, it could be some
> > bug that is just shared between the two by virtue of them still having a
> > lot of stuff in common.
> >
> > Linus
>
> I've been following this thread for a while now as I started
> experiencing file corruption in rtorrent when I upgraded to 2.6.19. I
> am using reiserfs.
reiserfs defaults to data=ordered, so it's quite possibly the same bug.