2008-06-17 06:16:17

by Bron Gondwana

[permalink] [raw]
Subject: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Background: we recently upgraded one of our 64bit kernel
machines to 2.6.25 and discovered that Cyrus skiplist
files were becoming randomly corrupted. This is a machine
with a 32bit Debian Etch userland but a 64bit kernel with
32bit support.

We run this way so we can actually use the 12Gb memory as
cache without running out of inode space, but don't have
to support two different sets of userland across our
servers.


The symptom - 16 to 24 bytes of zero appearing "randomly"
within the file after a "checkpoint" (file rewrite,
skipping stale copies of records).

Further investigation by retaining the pre-checkpoint files
showed that those bytes were the last ones of a page in the
original file.


Attached is a small C program which recreates the actions
that Cyrus takes, and uses record lengths identical to a
known-broken skiplist file on our systems.

Using this I was able to bisect the kernel to find the
commit which caused the problem:


08291429cfa6258c4cd95d8833beb40f828b194e is first bad commit
commit 08291429cfa6258c4cd95d8833beb40f828b194e
Author: Nick Piggin <[email protected]>
Date: Tue Oct 16 01:24:59 2007 -0700

mm: fix pagecache write deadlocks

Modify the core write() code so that it won't take a pagefault while holding a
lock on the pagecache page. There are a number of different deadlocks possible
if we try to do such a thing:

[...]

Signed-off-by: Nick Piggin <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>


For Cyrus users, this is a really serious bug - occasionally
the zeros will hit a "navigational component" of the file,
causing crashes and being noticeable. Most of the time
(including this example) it will just cause silent corruption
and data loss.

I suspect this will be visible to users as large swathes of
messages becoming unread, and if it hits the mailboxes.db,
large swathes of mailboxes just disappearing. Not good.


I apologise for the length of the attached C program. I tried
to make it shorter, but kept not tickling the bug. There's also
another advantage of keeping it like this - it closely mirrors the
Cyrus behaviour, to the point where the output is a valid skiplist
file.

It also has a "magic" mode, just pass a second parameter. It will
read through the mapped memory in order before the checkpoint.
This makes the bug disappear.

Let me know if there's anything else I can do to make this report
clearer. I've had a quick glance around the code, but especially
since it's 64 bit kernel only bug (I tested by rebooting the test
machine with a 2.6.25.3 32 bit kernel I had lying around and the
bug was not visible there. I've also repeated the test on my Ubuntu
desktop machine with the shipped kernel vs my hand-compiled
known-bad kernel, and tested with a 64bit userland in a chroot as
well).

Regards,

Bron.

--
Bron Gondwana
[email protected]


Attachments:
(No filename) (2.89 kB)
maptest.c (9.59 kB)
Download all attachments

2008-06-17 06:26:22

by Bron Gondwana

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)


On Tue, 17 Jun 2008 16:00:10 +1000, "Bron Gondwana" <[email protected]> said:
> I apologise for the length of the attached C program. I tried
> to make it shorter, but kept not tickling the bug. There's also
> another advantage of keeping it like this - it closely mirrors the
> Cyrus behaviour, to the point where the output is a valid skiplist
> file.

And I appear to have sent the one without the usage comments at the top.
Here they are:

* USAGE:
* $ gcc -g maptest.c
* $ ./a.out z
* $ hexdump -C z.NEW
*
* Notice the block of zero byte records appearing within
* the output.
*
* Run with a second argument:
*
* $ ./a.out z 1
* $ hexdump -C z.NEW
*
* No more zeros!

Bron.
--
Bron Gondwana
[email protected]

2008-06-17 17:09:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Bron Gondwana wrote:
>
> And I appear to have sent the one without the usage comments at the top.
> Here they are:

Very interesting. There's certainly something there.

That said, there's a distracting bug which is visible when doing an strace

lseek(4, 140333890921392, SEEK_SET) = -1 EINVAL (Invalid argument)
write(4, "\0\0\0\0", 4) = 4

which is from that

lseek(newfd, mapbase + offset + size - 8, 0);
write(newfd, (char *) &zero, 4);

where the addition of "mapbase" is insane. So that will write zeroes to
the wrong part of the file (offset 64, to be exact). And that will get
overwritten by the next write, making it all look entirely insane.

That said, that bug may be distracting, but it seems to have nothign at
all to do with the actual problem. The bug seems to happen only when the
file is not pre-paged in.

Nick?

Linus

2008-06-17 17:45:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Linus Torvalds wrote:
>
> That said, that bug may be distracting, but it seems to have nothign at
> all to do with the actual problem. The bug seems to happen only when the
> file is not pre-paged in.

Bron, does this untested patch hide the bug?

> Nick?

I don't think this patch is correct, because it doesn't really fix the
basic issue (the code should do the right thing even if a page isn't
there), but it might hide it by faulting in the whole "bytes" range rather
than just the first iov.

So Nick, it's still over to you, but if this does hide it, then that's an
interesting detail in itself.

Linus

---
mm/filemap.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6a7d3..0080a27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1808,9 +1808,20 @@ EXPORT_SYMBOL(iov_iter_advance);
*/
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
- char __user *buf = i->iov->iov_base + i->iov_offset;
- bytes = min(bytes, i->iov->iov_len - i->iov_offset);
- return fault_in_pages_readable(buf, bytes);
+ unsigned long offset = i->iov_offset;
+ const struct iovec *iov = i->iov;
+
+ while (bytes) {
+ char __user *buf = iov->iov_base + offset;
+ size_t n = min(bytes, iov->iov_len - offset);
+
+ if (fault_in_pages_readable(buf, n))
+ return -EFAULT;
+ bytes -= n;
+ offset = 0;
+ iov++;
+ }
+ return 0;
}
EXPORT_SYMBOL(iov_iter_fault_in_readable);

2008-06-17 20:17:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Linus Torvalds wrote:
>
> Bron, does this untested patch hide the bug?

Ok, it does hide the bug.

> I don't think this patch is correct, because it doesn't really fix the
> basic issue (the code should do the right thing even if a page isn't
> there), but it might hide it by faulting in the whole "bytes" range rather
> than just the first iov.
>
> So Nick, it's still over to you, but if this does hide it, then that's an
> interesting detail in itself.

I actually am starting to think that the bug is in
__copy_to_user_inatomic_nocache().

The return value of that thing in the case of a failure seems to be
totally wrong. In particular, since that function does an unrolled loop of
accesses like so:

.Ls1: movq (%rsi),%r11
.Ls2: movq 1*8(%rsi),%r8
.Ls3: movq 2*8(%rsi),%r9
.Ls4: movq 3*8(%rsi),%r10
.Ld1: movnti %r11,(%rdi)
.Ld2: movnti %r8,1*8(%rdi)
.Ld3: movnti %r9,2*8(%rdi)
.Ld4: movnti %r10,3*8(%rdi)

it may have done three of the 64-bit loads, and then trap on the fouth:
but since it hasn't done a _single_ of the stores, it shouldn't count as
any different whether it traps on any of .Ls[1-4]. But that code
definitely seems to think it makes a difference whether the exception
happened at Ls1 or at Ls4, even though both points have copied _exactly_
as many bytes when the exception happens.

So I'm starting to think the bug is all in there, not in the VM itself.
See arch/x86/lib/copy_user_nocache.S.

In fact, even the comment there implies that the author didn't know or
care about returning the correct value.

Andi and Ingo added to Cc.

Linus

2008-06-17 20:42:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Linus Torvalds wrote:
>
> I actually am starting to think that the bug is in
> __copy_to_user_inatomic_nocache().

Confirmed.

The uncached user copies are totally broken. The number of bytes left
uncopied is simply wrong, because of how it does that unrolled loop and
doesn't account for the fact that just doing loads does not actually
increase the number of bytes copied at all.

So because the "copy_to_user_inatomic()" logic cares _deeply_ about how
many bytes were actually copied, when the copy count is wrong, the code
ends up thinking that it copied more bytes than it actually did, resulting
in the corruption in the page cache.

Nasty.

That whole file is a mess. Sadly, so is the regular "copy_user_64.S" too
(it has the same totally broken comment, too!), this is not just the
uncached version.

And the only reason that it only shows up with the uncached version in
_practice_ is that the routine that uses the x86 string instructions (ie
the "rep movsq" in copy_user_generic_string) actually gets this all right.

So the bug is hidden in that case - which is most CPU's out there (all
CPU's that have X86_FEATURE_REP_GOOD set).

I don't think that code is reasonably salvageable. Damn.

Linus

2008-06-17 21:00:18

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Linus Torvalds <[email protected]> writes:

>
> So I'm starting to think the bug is all in there, not in the VM itself.
> See arch/x86/lib/copy_user_nocache.S.

The x86-64 copy_*_user functions were always designed to return errors
both ways (as in both for load and for store). That's needed because
the loops are shared for copy_to_user and copy_from_user. That's normally
ok because when you do _to_user you shouldn't fault on the loads
and vice versa. If a caller does that it's buggy.

-Andi

2008-06-17 21:07:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Linus Torvalds wrote:
>
> I don't think that code is reasonably salvageable. Damn.

Hmm. Something like this *may* salvage it.

Untested, so far (I'll reboot and test soon enough), but even if it fixes
things, it's not really very good.

Why?

Because of the unrolling, we may be doing 32 bytes worth of reads from
user space before doing a *single* write, so if we have a user copy from
the end of a page (say, 16 bytes from the end), we'd take the fault on the
third 8-byte load, and not copy anything at all.

So instead of copying 16 bytes and then returning "I copied 16 bytes", it
will copy _zero_ bytes, and now return "I copied 0 bytes" (it used to
incorrectly return that it copied 16 bytes because it could _read_ 16
bytes even though it never wrote them! Totally buggy!).

But I think the mm/filemap.c routines handle this case correctly (because
of how it probes at least the first iovec), so at least copied will
generally not be zero. But as mentioned, this isn't tested yet.

It would be better to not do the unrolling, or at least do the faulting
behaviour correctly so that we fall back on a byte-for-byte copy when it
faults. But not even the "rep movs" version has ever been _that_ careful,
so I guess that's ok.

Somebody should _really_ double-check this, and it would be wonderful if
somebody can come up with something better than this patch.

Linus
---
arch/x86/lib/copy_user_64.S | 25 +++++++++++--------------
arch/x86/lib/copy_user_nocache_64.S | 25 +++++++++++--------------
2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 70bebd3..ee1c3f6 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -217,19 +217,19 @@ ENTRY(copy_user_generic_unrolled)
/* table sorted by exception address */
.section __ex_table,"a"
.align 8
- .quad .Ls1,.Ls1e
- .quad .Ls2,.Ls2e
- .quad .Ls3,.Ls3e
- .quad .Ls4,.Ls4e
- .quad .Ld1,.Ls1e
+ .quad .Ls1,.Ls1e /* Ls1-Ls4 have copied zero bytes */
+ .quad .Ls2,.Ls1e
+ .quad .Ls3,.Ls1e
+ .quad .Ls4,.Ls1e
+ .quad .Ld1,.Ls1e /* Ld1-Ld4 have copied 0-24 bytes */
.quad .Ld2,.Ls2e
.quad .Ld3,.Ls3e
.quad .Ld4,.Ls4e
- .quad .Ls5,.Ls5e
- .quad .Ls6,.Ls6e
- .quad .Ls7,.Ls7e
- .quad .Ls8,.Ls8e
- .quad .Ld5,.Ls5e
+ .quad .Ls5,.Ls5e /* Ls5-Ls8 have copied 32 bytes */
+ .quad .Ls6,.Ls5e
+ .quad .Ls7,.Ls5e
+ .quad .Ls8,.Ls5e
+ .quad .Ld5,.Ls5e /* Ld5-Ld8 have copied 32-56 bytes */
.quad .Ld6,.Ls6e
.quad .Ld7,.Ls7e
.quad .Ld8,.Ls8e
@@ -244,11 +244,8 @@ ENTRY(copy_user_generic_unrolled)
.quad .Le5,.Le_zero
.previous

- /* compute 64-offset for main loop. 8 bytes accuracy with error on the
- pessimistic side. this is gross. it would be better to fix the
- interface. */
/* eax: zero, ebx: 64 */
-.Ls1e: addl $8,%eax
+.Ls1e: addl $8,%eax /* eax is bytes left uncopied within the loop (Ls1e: 64 .. Ls8e: 8) */
.Ls2e: addl $8,%eax
.Ls3e: addl $8,%eax
.Ls4e: addl $8,%eax
diff --git a/arch/x86/lib/copy_user_nocache_64.S b/arch/x86/lib/copy_user_nocache_64.S
index 5196762..9d3d1ab 100644
--- a/arch/x86/lib/copy_user_nocache_64.S
+++ b/arch/x86/lib/copy_user_nocache_64.S
@@ -145,19 +145,19 @@ ENTRY(__copy_user_nocache)
/* table sorted by exception address */
.section __ex_table,"a"
.align 8
- .quad .Ls1,.Ls1e
- .quad .Ls2,.Ls2e
- .quad .Ls3,.Ls3e
- .quad .Ls4,.Ls4e
- .quad .Ld1,.Ls1e
+ .quad .Ls1,.Ls1e /* .Ls[1-4] - 0 bytes copied */
+ .quad .Ls2,.Ls1e
+ .quad .Ls3,.Ls1e
+ .quad .Ls4,.Ls1e
+ .quad .Ld1,.Ls1e /* .Ld[1-4] - 0..24 bytes coped */
.quad .Ld2,.Ls2e
.quad .Ld3,.Ls3e
.quad .Ld4,.Ls4e
- .quad .Ls5,.Ls5e
- .quad .Ls6,.Ls6e
- .quad .Ls7,.Ls7e
- .quad .Ls8,.Ls8e
- .quad .Ld5,.Ls5e
+ .quad .Ls5,.Ls5e /* .Ls[5-8] - 32 bytes copied */
+ .quad .Ls6,.Ls5e
+ .quad .Ls7,.Ls5e
+ .quad .Ls8,.Ls5e
+ .quad .Ld5,.Ls5e /* .Ld[5-8] - 32..56 bytes copied */
.quad .Ld6,.Ls6e
.quad .Ld7,.Ls7e
.quad .Ld8,.Ls8e
@@ -172,11 +172,8 @@ ENTRY(__copy_user_nocache)
.quad .Le5,.Le_zero
.previous

- /* compute 64-offset for main loop. 8 bytes accuracy with error on the
- pessimistic side. this is gross. it would be better to fix the
- interface. */
/* eax: zero, ebx: 64 */
-.Ls1e: addl $8,%eax
+.Ls1e: addl $8,%eax /* eax: bytes left uncopied: Ls1e: 64 .. Ls8e: 8 */
.Ls2e: addl $8,%eax
.Ls3e: addl $8,%eax
.Ls4e: addl $8,%eax



2008-06-17 21:15:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Andi Kleen wrote:
>
> The x86-64 copy_*_user functions were always designed to return errors
> both ways (as in both for load and for store).

That's not the problem, Andi.

The problem is that it returns THE WRONG VALUE!

If the fault happened on the second load, but the first load was never
actually paired up with a store (because of unrolling the loop), then YOU
MUST NOT CLAIM THAT YOU DID A 8-BYTE COPY! Because you have copied exactly
_zero_ bytes, even though you _loaded_ 8 bytes successfully!

See?

Claiming that you copied 8 bytes when you didn't do anything at all is
WRONG. It's so incredibly wrong that it is scary.

Linus

2008-06-17 21:15:40

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Linus Torvalds wrote:
>
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
>> I actually am starting to think that the bug is in
>> __copy_to_user_inatomic_nocache().
>
> Confirmed.
>
> The uncached user copies are totally broken. The number of bytes left
> uncopied is simply wrong, because of how it does that unrolled loop and
> doesn't account for the fact that just doing loads does not actually
> increase the number of bytes copied at all.

How can a load fault legitimately in copy_to_user?

-Andi

2008-06-17 21:16:27

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Linus Torvalds wrote:
>
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
>> I don't think that code is reasonably salvageable. Damn.
>
> Hmm. Something like this *may* salvage it.
>
> Untested, so far (I'll reboot and test soon enough), but even if it fixes
> things, it's not really very good.

If that fixes anything:
- The caller is broken because it shouldn't pass a faulting source to copy_to_user()
- And you broken copy_from_user error reporting which shares the same code

-Andi

2008-06-17 21:21:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Linus Torvalds wrote:
>
> Hmm. Something like this *may* salvage it.
>
> Untested, so far (I'll reboot and test soon enough), but even if it fixes
> things, it's not really very good.

Ok, so I just rebooted with this, and it does indeed fix the bug.

I'd be happier with a more complete fix (ie being byte-accurate and
actually doing the partial copy when it hits a fault in the middle), but
this seems to be the minimal fix, and at least fixes the totally bogus
return values from the x86-64 __copy_user*() functions.

Not that I checked that I got _all_ cases correct (and maybe there are
other versions of __copy_user that I missed entirely), but Bron's
test-case at least seems to work properly for me now.

Bron? If you have a more complete test-suite (ie the real-world case that
made you find this), it would be good to verify the whole thing.

And again, this is the kind of thing that really wants others looking at
it. I personally think that this thing should likely be written as C code,
with just the inner loops done as asm (ie the inner "rep movsq" and the
inner 64-byte unrolled cached/uncached copies done as inline asms, and
then the outer layers could be C).

Linus

2008-06-17 21:25:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Andi Kleen wrote:
>
> If that fixes anything:
> - The caller is broken because it shouldn't pass a faulting source to copy_to_user()
> - And you broken copy_from_user error reporting which shares the same code

Andi, I'm sorry I cc'd you. You are the author of that crap, but the bug
seems to be that you never even understood what copy_from_user() is
supposed to do.

The whole *and*only* reason for copy_to/from_user() existing AT ALL is
exactly the fact that the source or destination access can fault.

I don't really see why you continually start arguing about things that are
OBVIOUSLY BUGGY, as if they weren't buggy. Once somebody has debugged a
buggy routine, you shouldn't argue against it.

So here's a hint: next time I claim some code of yours is buggy, either
just acknowledge the bug, or stay silent. You'll look smarter that way.

Linus

2008-06-17 21:27:03

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Linus Torvalds wrote:
>
> On Tue, 17 Jun 2008, Andi Kleen wrote:
>> The x86-64 copy_*_user functions were always designed to return errors
>> both ways (as in both for load and for store).
>
> That's not the problem, Andi.
>
> The problem is that it returns THE WRONG VALUE!
>
> If the fault happened on the second load,

Loads are not supposed to fault in copy_to_user(). Only stores are.

The way it works is that it assumes that either loads fault (when used
as copy_from_user) or stores (copy_to_user), but never both.

> but the first load was never
> actually paired up with a store (because of unrolling the loop), then YOU
> MUST NOT CLAIM THAT YOU DID A 8-BYTE COPY! Because you have copied exactly
> _zero_ bytes, even though you _loaded_ 8 bytes successfully!
>
> See?
>
> Claiming that you copied 8 bytes when you didn't do anything at all is
> WRONG. It's so incredibly wrong that it is scary.

If your patch fixes something then the main wrong thing is the caller
who passes a faulting source address.

And again it always breaks the other case.

-Andi

2008-06-17 21:30:54

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Linus Torvalds wrote:
>
> On Tue, 17 Jun 2008, Andi Kleen wrote:
>> If that fixes anything:
>> - The caller is broken because it shouldn't pass a faulting source to copy_to_user()
>> - And you broken copy_from_user error reporting which shares the same code
>
> Andi, I'm sorry I cc'd you. You are the author of that crap, but the bug
> seems to be that you never even understood what copy_from_user() is
> supposed to do.
>
> The whole *and*only* reason for copy_to/from_user() existing AT ALL is
> exactly the fact that the source or destination access can fault.

yes, but only one of them (destination for copy_to_user and source for
copy_from_user)

Or are you're describing copy_in_user()?

> I don't really see why you continually start arguing about things that are
> OBVIOUSLY BUGGY, as if they weren't buggy. Once somebody has debugged a
> buggy routine, you shouldn't argue against it.
>
> So here's a hint: next time I claim some code of yours is buggy, either
> just acknowledge the bug, or stay silent. You'll look smarter that way.

Ok if I'm really wrong on this (but frankly I don't see the mistake, sorry)
for my person edification: what's a legitimate case for copy_to_user()
where the source can fault?

-Andi

2008-06-17 21:32:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Andi Kleen wrote:
>
> Loads are not supposed to fault in copy_to_user(). Only stores are.

Andi, just shut up already.

It is "copy_user". Notice the lack of "from" or "to". That code handles
*both* copy_to_user and copy_from_user.

> If your patch fixes something then the main wrong thing is the caller
> who passes a faulting source address.

Andi, SHUT THE F*CK UP. Read the code. Read the patch. Sorry for ever
involving you. I don't want to hear your idiotic whining.

Linus

2008-06-17 21:35:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Linus Torvalds wrote:
>
> It is "copy_user". Notice the lack of "from" or "to". That code handles
> *both* copy_to_user and copy_from_user.

Oh, and yes, I said "copy_to_user_inatomic()" by mistake. The fact is that
"write()" obviously does copy_from_user, but more importantly that on x86
it's THE EXACT SAME ROUTINE.

Linus

2008-06-17 21:35:28

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Linus Torvalds wrote:
>
> On Tue, 17 Jun 2008, Andi Kleen wrote:
>> Loads are not supposed to fault in copy_to_user(). Only stores are.
>
> Andi, just shut up already.
>
> It is "copy_user". Notice the lack of "from" or "to". That code handles
> *both* copy_to_user and copy_from_user.

Yes, but it assumes only one can fault at a time.

>
>> If your patch fixes something then the main wrong thing is the caller
>> who passes a faulting source address.
>
> Andi, SHUT THE F*CK UP. Read the code. Read the patch.

I did exactly the same patch first on your first email and then I realized
it was wrong before posting ;-)

> Sorry for ever
> involving you. I don't want to hear your idiotic whining.

The patch is wrong because it'll break the other case (in this case
copy_from_user)

-Andi

2008-06-17 21:36:32

by Al Viro

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

On Tue, Jun 17, 2008 at 02:24:39PM -0700, Linus Torvalds wrote:

> The whole *and*only* reason for copy_to/from_user() existing AT ALL is
> exactly the fact that the source or destination access can fault.

Erm... The reason for copy_to_user() existing is that dereferencing
a user-supplied address in the kernel does not have to access any
memory reachable in user mode, regardless of any faults. WTF are you
guys talking about?

2008-06-17 21:38:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Andi Kleen wrote:
>
> Ok if I'm really wrong on this (but frankly I don't see the mistake, sorry)
> for my person edification: what's a legitimate case for copy_to_user()
> where the source can fault?

Andi, read the patch already. And then shut up. I'm so tired of you always
trying to argue against fixes, just because you wrote the code.

The fact is, that code was buggy. I've explained _why_ it was buggy. I've
sent a patch. I've even talked about how I _tested_ the patch. You've read
none of that, you just fixated on the fact that I said "to" instead of
"from" by mistake, even though it's the exact same code.

At no point did you at all bother to read the patch, nor did you try to
understand the problem, nor did you follow the original report or try it
out, or think about it.

So tell me why I shouldn't just put you in my "idiots" filter? The
aggravation just isn't worth it for me.

When you grow up and can admit that your code was buggy, or at least
*look* at the patches, feel free to start sending me email again.

Until then, just don't bother.

Linus

2008-06-17 21:42:18

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Al Viro wrote:
> On Tue, Jun 17, 2008 at 02:24:39PM -0700, Linus Torvalds wrote:
>
>> The whole *and*only* reason for copy_to/from_user() existing AT ALL is
>> exactly the fact that the source or destination access can fault.
>
> Erm... The reason for copy_to_user() existing is that dereferencing
> a user-supplied address in the kernel does not have to access any
> memory reachable in user mode, regardless of any faults.

Well on x86 it is reachable, so it only handles faults.

> WTF are you
> guys talking about?

Linus seems to think that copy_to_user() should have
copy_in_user semantics(). It happens to be in some cases (when string instructions
are used), but not for the unrolled case.

What seems also confusing him is that x86-64 copy_from/to_user use a shared
subfunction. The trick that this subfunction uses is to assume that
either the destination faults or the source, but never both. It's legal
because the caller should never pass in a faulting source for copy to
or a faulting destination for copy from.

Actually they handle it, but the return value is not correct.

Now he "fixed" copy_to_user to return a kind of correct return value
for source faults, but it'll of course break copy_from_user()'s return value.

It's still unclear why his patch fixes the test case. The caller should
be using copy_in_user perhaps? Or is it just buggy by passing something
unmapped to copy_to_user?

-Andi

2008-06-17 21:47:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Andi Kleen wrote:
>
> The patch is wrong because it'll break the other case (in this case
> copy_from_user)

Ok, I'm now putting you in my idiots filter.

No, it will not break the other case. You're an idiot. Loading a value
without using it will not break anything, quite the reverse. What the
patch does is to _fix_ copy_from_user(), because if the second load traps,
then the fact that we did the first load IS IMMATERIAL, because its result
was never stored!

So the patch _fixes_ copy_from_user(), exactly because it says that even
if you've loaded 24 bytes, but you faulted on the fourth load, you've
still _copied_ exactly zero bytes, because you didn't actually store the
24 bytes you loaded.

And it is a no-op for copy_to_user, since copy_to_user will never fault on
the load (not the first one, not the second one, not _any_ load), so the
exception table entries for the loads are unusued.

IOW:
- the patch fixes a bug
- you refuse to acknowledge this
- I'll put you in my "flamers" filter that goes into another mailbox,
because it's not worth my time even arguing with you any more.

Sorry for ever adding you to the cc. I thought it might be a good idea,
since you were the author of the code. But clearly the bug was not because
you made a mistake, but because you simply don't seem to understand what
the function is supposed to return, and you're not even interested in
learning.

Linus

2008-06-17 21:50:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Tue, 17 Jun 2008, Andi Kleen wrote:
>
> Linus seems to think that copy_to_user() should have
> copy_in_user semantics().

No.

I *know* that copy_to/from_user() is supposed to return the number of
bytes not copied.

And if you loaded a value but didn't store it to the destination, then BY
DEFINITION you didn't copy it.

Comprende?

Obviously you don't.

Gaah. Why do I keep bothering to even _try_ to educate you? Can't you
accept that I just fixed a bug, which had a test-case an everything?

Andi, really. Take it from me. If I tell you something, I'm usually right.
Think about that for a moment, instead of living in your own little
dream-world where you think you understand everything.

Linus

2008-06-17 22:11:31

by Al Viro

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

On Tue, Jun 17, 2008 at 11:42:03PM +0200, Andi Kleen wrote:

> What seems also confusing him is that x86-64 copy_from/to_user use a shared
> subfunction. The trick that this subfunction uses is to assume that
> either the destination faults or the source, but never both. It's legal
> because the caller should never pass in a faulting source for copy to
> or a faulting destination for copy from.
>
> Actually they handle it, but the return value is not correct.
>
> Now he "fixed" copy_to_user to return a kind of correct return value
> for source faults, but it'll of course break copy_from_user()'s return value.
>
> It's still unclear why his patch fixes the test case. The caller should
> be using copy_in_user perhaps? Or is it just buggy by passing something
> unmapped to copy_to_user?

AFAICS, what happened is that b0rken copy_*FROM*_user() had been discussed
with references to copy_*TO*_user(). With proposed patch indeed not affecting
any legitimate calls of the latter. Does affect the former and that, from
my reading of the code in question, correctly.

IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
make sense.

2008-06-17 22:22:07

by Andi Kleen

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)


> AFAICS, what happened is that b0rken copy_*FROM*_user() had been discussed
> with references to copy_*TO*_user(). With proposed patch indeed not affecting
> any legitimate calls of the latter. Does affect the former and that, from
> my reading of the code in question, correctly.
>
> IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
> make sense.

Yes, it makes some more sense, but I'm not completely happy with the fix
because it makes the fault point reporting very unreliable (maximum error
will be 63 instead of 7 now). iirc especially mount was sensitive to that.

Unfortunately fixing the accuracy is a little tricky.

-Andi

2008-06-18 02:27:33

by Bron Gondwana

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
>
>
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
> >
> > Hmm. Something like this *may* salvage it.
> >
> > Untested, so far (I'll reboot and test soon enough), but even if it fixes
> > things, it's not really very good.
>
> Ok, so I just rebooted with this, and it does indeed fix the bug.
>
> I'd be happier with a more complete fix (ie being byte-accurate and
> actually doing the partial copy when it hits a fault in the middle), but
> this seems to be the minimal fix, and at least fixes the totally bogus
> return values from the x86-64 __copy_user*() functions.
>
> Not that I checked that I got _all_ cases correct (and maybe there are
> other versions of __copy_user that I missed entirely), but Bron's
> test-case at least seems to work properly for me now.
>
> Bron? If you have a more complete test-suite (ie the real-world case that
> made you find this), it would be good to verify the whole thing.

I have a real world test case using "cyr_dbtool" from Cyrus 2.3.12 on a
known-bug-inducing piece of data (the key and value sizes in the example
code were taken from that. Indeed, the example code started off
building byte-for-byte identical files, then I changed it to write only
a single character across the entire key and value so the hexdump was
shorter)

I'll give it a go.

Bron.

2008-06-18 02:36:16

by Bron Gondwana

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

On Tue, Jun 17, 2008 at 10:08:24AM -0700, Linus Torvalds wrote:
> On Tue, 17 Jun 2008, Bron Gondwana wrote:
> >
> > And I appear to have sent the one without the usage comments at the top.
> > Here they are:
>
> Very interesting. There's certainly something there.
>
> That said, there's a distracting bug which is visible when doing an strace
>
> lseek(4, 140333890921392, SEEK_SET) = -1 EINVAL (Invalid argument)
> write(4, "\0\0\0\0", 4) = 4

Bah - crap. Sorry about that. I sent you the wrong version of the
code. That's what I get for not putting it in revision control.

That bug exists because I did an htonl on a value that I shouldn't
have.

Bron ( now to read the rest of the thread! )

2008-06-18 03:14:27

by Bron Gondwana

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
> On Tue, 17 Jun 2008, Linus Torvalds wrote:
> >
> > Hmm. Something like this *may* salvage it.
> >
> > Untested, so far (I'll reboot and test soon enough), but even if it fixes
> > things, it's not really very good.
>
> Ok, so I just rebooted with this, and it does indeed fix the bug.
>
> I'd be happier with a more complete fix (ie being byte-accurate and
> actually doing the partial copy when it hits a fault in the middle), but
> this seems to be the minimal fix, and at least fixes the totally bogus
> return values from the x86-64 __copy_user*() functions.
>
> Not that I checked that I got _all_ cases correct (and maybe there are
> other versions of __copy_user that I missed entirely), but Bron's
> test-case at least seems to work properly for me now.
>
> Bron? If you have a more complete test-suite (ie the real-world case that
> made you find this), it would be good to verify the whole thing.

Ok - I pulled the latest linus-2.6 git, and discovered
the patch was already in there, so I just built and
rebooted (git 952f4a0a9b27e6dbd5d32e330b3f609ebfa0b061).

Confirmed - fixed in both the test code and the cyr_dbtool
test case I was using previously (I would have posted that
instead, but building cyrus is a bit of pain. You need
bdb and sasl and all sorts of extraneous crap - and
cyrusdb_skiplist.c depends on about half of Cyrus'
infrastructure, so I couldn't just pull it out by itself)

For my sins, I appear to be becoming the world expert on
that particular file. I've debugged skiplist bugs many
times over, and completely rewritten the locking code.
It really does some pretty evil things - the memory accesses
look something like this:

[file...................]
[mmap^....^.^........^^..................................]
[file...................++++++++++++]
[mmap^....^.^........^^.^^ ^ ^^.....................]

Where (^) is the bits that get accessed. All reads are via
the mmap, all writes are done with retry_write or
retry_writev (Cyrus library functions that keep hammering
until all the bytes are written)

I was suspecting as early as Friday night (we've been
debugging this one for a few days now!) that it was page
break related, because the bug only seemed to be appearing
on seen databases with really long seen lists (they're in
ranged integer format like 1:5,7:9,12,14:22,24:...).

It didn't help that at first we were only finding out about
cases where the corruption hit exactly on the "navigational
components", hence breaking the skiplist logic. And then
the backpointer writes would scribble all over the corrupt
area as well, so that made it even stranger to debug!


OK - so I'll report this issue to the Cyrus mailing list.
Warn people not to run on kernels 2.6.23 -> 2.6.25.7 with
x86_64 kernels. At least not without the skanky little
patch that I'm planning to post:

int magic = 0;
for (i = 0; i < maplen; i++) magic ^= mapbase[i];

Since I've tested that as a viable workaround!

Bron.

2008-06-18 04:04:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)



On Wed, 18 Jun 2008, Bron Gondwana wrote:
>
> For my sins, I appear to be becoming the world expert on
> that particular file.

Heh. Congrats ;)

> I've debugged skiplist bugs many times over, and completely rewritten
> the locking code. It really does some pretty evil things - the memory
> accesses look something like this:
>
> [file...................]
> [mmap^....^.^........^^..................................]
> [file...................++++++++++++]
> [mmap^....^.^........^^.^^ ^ ^^.....................]
>
> Where (^) is the bits that get accessed. All reads are via
> the mmap, all writes are done with retry_write or
> retry_writev (Cyrus library functions that keep hammering
> until all the bytes are written)

Is there any reason it doesn't use mmap(MAP_SHARED) and make the
modifications that way too?

Because quite frankly, the mixture of doing mmap() and write() system
calls is quite fragile - and I'm not saying that just because of this
particular bug, but because there are all kinds of nasty cache aliasing
issues with virtually indexed caches etc that just fundamentally mean that
it's often a mistake to mix mmap with read/write at the same time.

(For the same reason it's not a good idea to mix writing through an mmap()
and then using read() to read it - again, you can have some nasty aliasing
going on there).

So this particular issue was definitely a kernel bug (and big thanks for
making such a good test-case), but in general, it does sound like Cyrus is
actively trying to dig itself into a nasty hole there.

Linus

2008-06-18 05:11:41

by Bron Gondwana

[permalink] [raw]
Subject: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))

On Tue, Jun 17, 2008 at 09:03:17PM -0700, Linus Torvalds wrote:
> On Wed, 18 Jun 2008, Bron Gondwana wrote:
> >
> > For my sins, I appear to be becoming the world expert on
> > that particular file.
>
> Heh. Congrats ;)
>
> > I've debugged skiplist bugs many times over, and completely rewritten
> > the locking code. It really does some pretty evil things - the memory
> > accesses look something like this:
> >
> > [file...................]
> > [mmap^....^.^........^^..................................]
> > [file...................++++++++++++]
> > [mmap^....^.^........^^.^^ ^ ^^.....................]
> >
> > Where (^) is the bits that get accessed. All reads are via
> > the mmap, all writes are done with retry_write or
> > retry_writev (Cyrus library functions that keep hammering
> > until all the bytes are written)
>
> Is there any reason it doesn't use mmap(MAP_SHARED) and make the
> modifications that way too?

Portability[tm].

It actually does use MAP_SHARED already, but only for reading.
Writing is all done with seeks and writes, followed by a map
"refresh", which is really just an unmmap/mmap if the file has
extended past the "SLOP" (between 8 and 16 k after the end of
the file length at last mapping).

I can't just right now find the place where the reasoning behind
this was explained to me. The theory I think was that mmap write
support is unreliable across systems, but read support is pretty
good (except HPUX for which there is map_stupidshared.c)

> Because quite frankly, the mixture of doing mmap() and write() system
> calls is quite fragile - and I'm not saying that just because of this
> particular bug, but because there are all kinds of nasty cache aliasing
> issues with virtually indexed caches etc that just fundamentally mean that
> it's often a mistake to mix mmap with read/write at the same time.

I'm not actually a maintainer for Cyrus, I just write patches to
keep our mail servers working. I'll pass this on.

> (For the same reason it's not a good idea to mix writing through an mmap()
> and then using read() to read it - again, you can have some nasty aliasing
> going on there).
>
> So this particular issue was definitely a kernel bug (and big thanks for
> making such a good test-case), but in general, it does sound like Cyrus is
> actively trying to dig itself into a nasty hole there.

Yeah, indeed.

I suspect the response from the Cyrus side might include a
small dose of "POSIX says it's valid to do this, and making
it work is the kernel programmers' lookout".

Ahh - I found the explaination in doc/internal/hacking in
the Cyrus source tree. While 'ack' is a nice tool, it
doesn't check files with no extention by default. Ho hum:

- map_refresh and map_free

- In many cases, it is far more effective to read a file via the operating
system's mmap facility than it is to via the traditional read() and
lseek system calls. To this end, Cyrus provides an operating system
independent wrapper around the mmap() services (or lack thereof) of the
operating system.

- Cyrus currently only supports read-only memory maps, all writes back
to a file need to be done via the more traditional facilities. This is
to enable very low-performance support for operating systems which do not
provide an mmap() facility via a fake userspace mmap.

- To create a map, simply call map_refresh on the map (details are in
lib/map.h). To free it, call map_free on the same map.

- Despite the fact that the maps are read-only, it is often useful to open
the file descriptors O_RDWR, especially if the file decriptors could
possibly be used for writing elsewhere in the code. Some operating
systems REQUIRE file descriptors that are mmap()ed to be opened
O_RDWR, so just do it.

If I was God in the Cyrus world (woot) I suspect I'd
provide some sort of OS independent wrapper around the
various write functions, using mmap where appropriate,
while still working via lseek/write for those systems
without mmap support.

(Added CC: Ken Murchison, on the grounds that he actually
is God in the Cyrus world)

Thanks for the good explaination. I'll have a look at the
Cyrus code and see just how tricky that would actually be
(even just doing the skiplist, index and cache code would
hit 99% of the cases where files are both mmaped and
written concurrently)

Bron.

2008-06-18 06:10:40

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

On Wednesday 18 June 2008 07:46, Linus Torvalds wrote:
> On Tue, 17 Jun 2008, Andi Kleen wrote:

> So the patch _fixes_ copy_from_user(), exactly because it says that even
> if you've loaded 24 bytes, but you faulted on the fourth load, you've
> still _copied_ exactly zero bytes, because you didn't actually store the
> 24 bytes you loaded.

Yes, the new filemap.c code does not require an exact byte count, but
it won't work if there is an under-estimation of the number of bytes
left to copy.

The old filemap.c code actually also relies on the byte count in some
cases, I can't remember off the top of my head, but I *think* it was a
security measure to prevent uninitialized data leak.

Most other cases of course only care about complete success or not, but
there are others. filemap_xip, splice are a couple that pop up.

Thanks for working that all out before I even read my email :)

2008-06-18 06:23:03

by Nick Piggin

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected , reproducable)

On Wednesday 18 June 2008 08:21, Andi Kleen wrote:
> > AFAICS, what happened is that b0rken copy_*FROM*_user() had been
> > discussed with references to copy_*TO*_user(). With proposed patch
> > indeed not affecting any legitimate calls of the latter. Does affect the
> > former and that, from my reading of the code in question, correctly.
> >
> > IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
> > make sense.
>
> Yes, it makes some more sense, but I'm not completely happy with the fix
> because it makes the fault point reporting very unreliable (maximum error
> will be 63 instead of 7 now). iirc especially mount was sensitive to that.

It looks like mount does need an exact copy, so they've rolled their own
(exact_copy_from_user). I guess if you need an exact copy, then it doesn't
really matter how inexact an inexact one is, it's still unusable :)

All else being equal, a smaller maximum error is preferable, but surely
that is outweighed by the correctness issue of returning a valid number of
bytes left to operate on.

BTW. we already have lots (although steadily declining number) of corner
case issues around this whole area, but if we want to get really strict,
even an inexact report may be wrong for filemap.

Suppose we copy 10 bytes into the pagecache, but report that 5 were copied.
That means, we'll subsequently re-copy the delta. Between these two copies,
a 2nd writer might come in and write something over those 5 bytes. Then a
reader might see the following sequence of those 10 bytes
"0000000000"
"1111111111"
"2222222222"
"2222211111"

2008-06-18 16:23:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))



On Wed, 18 Jun 2008, Bron Gondwana wrote:
> On Tue, Jun 17, 2008 at 09:03:17PM -0700, Linus Torvalds wrote:
> >
> > Is there any reason it doesn't use mmap(MAP_SHARED) and make the
> > modifications that way too?
>
> Portability[tm].

Hmm.. I'm pretty sure that using MAP_SHARED for writing is _more_ portable
than mixing mmap() and "write()" - or at least more _consistent_.

That said, it's probably six one way, and half a dozen the other. The
shared writable mmap() doesn't work well on unix-lookalikes (ie "not real
unix"). That does include really really old Linux versions (ie 1.x
series), but more relevantly probably includes things like QNX etc.

On the other hand, the mmap()+write(), as mentioned, doesn't work well on
various hardware platforms where theer can be cache aliases, and that
includes HP-UX (as you apparently have noticed), but I'm pretty certain
there are other cases too.

The cache alias issue can actually be really thorny, because it's going to
be very hard to see and essentially random: if your working set is big
enough (or the cache is small enough) that the cache basically gets
flushed between the write and the access through the mmap (and vice
versa), you'll never see any problems.

But then, _occasionally_, you'll have really hard-to-replicate corruption
due to cache aliases (ie you read something from the mmap() after the
write, but you don't actually see the newly written data, because it's
cached at a different virtual address).

Linux tries really hard to be coherent between mmap and read/write even on
those kinds of platforms, but I would definitely not call it "portable".
It really is a fundamentally nasty thing, and depends deeply on the CPU
architecture, not just the OS.

> It actually does use MAP_SHARED already, but only for reading.
> Writing is all done with seeks and writes, followed by a map
> "refresh", which is really just an unmmap/mmap if the file has
> extended past the "SLOP" (between 8 and 16 k after the end of
> the file length at last mapping).

Yeah, I can certainly see that working. That said, I can also see it
failing, partly because of the CPU virtual indexing cache issues, but
partly because it's such an unusual thing to do (partly because it simply
is known not to work on some systems, ie HP-UX). And that will mean that
it is probably not a well-tested path.. As you found out.

(Side note: I mention HP-UX just because it is known to historically have
totally and utterly brain-damaged and useless mmap support. It _may_ be
that they've fixed it in more modern versions. It literally used to be a
mix of horrible hardware problems - the virtual cache issue - _and_ a VM
system that was based on some really old BSD code).

So the more traditional way would be to do an all-mmap thing, and extend
the file with ftruncate(), not write. That's somethign that programs like
nntpd have been doing for decades, so it's a very "traditional" model and
thus much more likely to be safe. It also avoids all the aliasing issues,
if all accesses are done the same way.

That said, you _would_ need to have alternate strategies to access things,
but apparently Cyrus already has such strategies at least for HP-UX.

> Ahh - I found the explaination in doc/internal/hacking in
> the Cyrus source tree. While 'ack' is a nice tool, it
> doesn't check files with no extention by default. Ho hum:
>
> - map_refresh and map_free
>
> - In many cases, it is far more effective to read a file via the operating
> system's mmap facility than it is to via the traditional read() and
> lseek system calls. To this end, Cyrus provides an operating system
> independent wrapper around the mmap() services (or lack thereof) of the
> operating system.

One of the issues here is that in order to give coherency for mmap +
read/write access, the OS may need to map the area uncached or at least
flush caches when writing. So from a pure performance standpoint, it can
also cause problems.

Of course, even a uncached mmap() _can_ certainly be faster than using
just read()/write(), depending on the access patterns. So maybe Cyrus is
doing the rigth thing, it just sounds rather fragile and prone to
unexpected and hard-to-debug problems.

Linus

2008-06-19 00:06:17

by Robert Mueller

[permalink] [raw]
Subject: Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))


>> It actually does use MAP_SHARED already, but only for reading.
>> Writing is all done with seeks and writes, followed by a map
>> "refresh", which is really just an unmmap/mmap if the file has
>> extended past the "SLOP" (between 8 and 16 k after the end of the
>> file length at last mapping).
>
> So the more traditional way would be to do an all-mmap thing, and
> extend the file with ftruncate(), not write. That's somethign that
> programs like nntpd have been doing for decades, so it's a very
> "traditional" model and thus much more likely to be safe. It also
> avoids all the aliasing issues, if all accesses are done the same way.

As noted above, one thing cyrus does which does seem to be plain "wrong"
is that it mmaps a region greater the file size (rounds to an 8k
boundary, but 8k-16k past the current end of the file) and then assumes
that when it writes to the end of the file (but less than the end of the
mmap region) that there's no need to remmap and that data is immediately
available within the previous mmaped region.

Apparently that works on most OS's (but is what this bug actually
exposed), but according to the mmap docs:

---
If the size of the mapped file changes after the call to mmap() as a
result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added or
removed portions of the file is unspecified.
---

The way I read that, even if you mmap a file with a size past the end of
the file currently, if you subsequently write to the end of that file,
you shouldn't assume that written data is available in the region you
previously mmaped, which cyrus definitely does do.

Amazingly (apart from HP/UX) no OS actually seems to have a problem with
this since there would be massive cyrus bug reports otherwise.

Rob

----------
[email protected]
Sign up at http://fastmail.fm for fast, ad free, IMAP accessible email

2008-06-19 00:21:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: Cyrus mmap vs lseek/write usage - (WAS: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable))



On Thu, 19 Jun 2008, Robert Mueller wrote:
>
> As noted above, one thing cyrus does which does seem to be plain "wrong"
> is that it mmaps a region greater the file size (rounds to an 8k
> boundary, but 8k-16k past the current end of the file) and then assumes
> that when it writes to the end of the file (but less than the end of the
> mmap region) that there's no need to remmap and that data is immediately
> available within the previous mmaped region.

Pretty much any OS that tries to be make mmap() coherent with regular
read/write accesses will automatically also have to be coherent wrt file
size updates.

IOW, I don't think that cyrus is real any more "wrong" in this than in
assuming that you can mix read/write and mmap() accesses. In fact, I
suspect that Cyrus is probably _more_ conservative than most, in that it
would not be totally unheard of to just do a much bigger mmap(), and not
even bother to re-do it until the file grows past that size (ie no 8k/16k
granularity, but make it arbitrarily non-granular).

> Apparently that works on most OS's (but is what this bug actually
> exposed), but according to the mmap docs:
>
> ---
> If the size of the mapped file changes after the call to mmap() as a
> result of some other operation on the mapped file, the effect of
> references to portions of the mapped region that correspond to added or
> removed portions of the file is unspecified.

Note that if you really want to be portable, you simply must not mix
mmap() with *any* other operations without sprinking in a healthy amount
of "msync()" or unmapping/remapping entirely.

So _in_practice_ - because everybody tries to do a good job - you can
actually expect to have mmap() be coherent, even though there are no real
guarantees.

> Amazingly (apart from HP/UX) no OS actually seems to have a problem with
> this since there would be massive cyrus bug reports otherwise.

Yeah. Over the years, the pain from having a non-coherent mmap() generally
has pushed everybody into just making mmap() easy to use. Which means that
mixing things generally works fine, even if it is not at all _guaranteed_.

So I'd expect mmap+write to work and be coherent almost always. But it's
still a fairly unusual combination, and I would personally think that
using MAP_SHARED and writing through the mmap() would be the less
surprising model.

Linus

2008-10-03 12:06:08

by Bron Gondwana

[permalink] [raw]
Subject: BUG: mmapfile/writev spurious zero bytes still in the wild

On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:

[reminder from way back: this bug was caused by writev containing
mmaped pages that weren't paged in, it's 64 bit only. It
particularly affects Cyrus Imapd's database formats]

> On Tue, 17 Jun 2008, Linus Torvalds wrote:
> >
> > Hmm. Something like this *may* salvage it.
> >
> > Untested, so far (I'll reboot and test soon enough), but even if it fixes
> > things, it's not really very good.
>
> Ok, so I just rebooted with this, and it does indeed fix the bug.
>
> I'd be happier with a more complete fix (ie being byte-accurate and
> actually doing the partial copy when it hits a fault in the middle), but
> this seems to be the minimal fix, and at least fixes the totally bogus
> return values from the x86-64 __copy_user*() functions.

Has this been revisited since? I haven't noticed, but I really only
skim LKML - have to save some time in the day for my real job[tm] of
keeping an email service running!

> Not that I checked that I got _all_ cases correct (and maybe there are
> other versions of __copy_user that I missed entirely), but Bron's
> test-case at least seems to work properly for me now.
>
> Bron? If you have a more complete test-suite (ie the real-world case that
> made you find this), it would be good to verify the whole thing.

It's been fine for us since, but unfortunately most of the world is
still running distribution "stable" kernels. I've just been helping a
user who's getting corrupted flat file databases on Ubuntu's stable 64
bit xen kernels, and it looks like it's the same issue.

Is there a standard way to tell backporters "you really need to add this
patch for your users' sanity"?

Bron ( I tried reporting it in Launchpad, but kept getting timeout
errors, so I figured reposting here might get noticed. Besides,
I can follow up on the "more complete fix" at the same time! )

2008-10-03 13:08:12

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes still in the wild

On Fri, 3 Oct 2008 21:44:14 +1000 Bron Gondwana <[email protected]> wrote:

> On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
>
> [reminder from way back: this bug was caused by writev containing
> mmaped pages that weren't paged in, it's 64 bit only. It
> particularly affects Cyrus Imapd's database formats]
>
> > On Tue, 17 Jun 2008, Linus Torvalds wrote:
> > >
> > > Hmm. Something like this *may* salvage it.
> > >
> > > Untested, so far (I'll reboot and test soon enough), but even if it fixes
> > > things, it's not really very good.
> >
> > Ok, so I just rebooted with this, and it does indeed fix the bug.
> >
> > I'd be happier with a more complete fix (ie being byte-accurate and
> > actually doing the partial copy when it hits a fault in the middle), but
> > this seems to be the minimal fix, and at least fixes the totally bogus
> > return values from the x86-64 __copy_user*() functions.
>
> Has this been revisited since? I haven't noticed, but I really only
> skim LKML - have to save some time in the day for my real job[tm] of
> keeping an email service running!
>
> > Not that I checked that I got _all_ cases correct (and maybe there are
> > other versions of __copy_user that I missed entirely), but Bron's
> > test-case at least seems to work properly for me now.
> >
> > Bron? If you have a more complete test-suite (ie the real-world case that
> > made you find this), it would be good to verify the whole thing.
>
> It's been fine for us since, but unfortunately most of the world is
> still running distribution "stable" kernels. I've just been helping a
> user who's getting corrupted flat file databases on Ubuntu's stable 64
> bit xen kernels, and it looks like it's the same issue.
>
> Is there a standard way to tell backporters "you really need to add this
> patch for your users' sanity"?

Yes, there is. We backport the patch into earlier kernel releases and
that action _should_ wake the distros up to take a look at the fix.

This particular fix (42a886af728c089df8da1b0017b0e7e6c81b5335) was
included in 2.6.26 and also is present in 2.6.25.17, but not 2.6.25.
So we did backport it into 2.6.25.x. Maybe distros were slow or errant
in picking up the patch.

2008-10-04 00:36:10

by Bron Gondwana

[permalink] [raw]
Subject: Re: BUG: mmapfile/writev spurious zero bytes still in the wild


On Fri, 3 Oct 2008 06:07:23 -0700, "Andrew Morton" <[email protected]> said:
> On Fri, 3 Oct 2008 21:44:14 +1000 Bron Gondwana <[email protected]>
> wrote:
> > Is there a standard way to tell backporters "you really need to add this
> > patch for your users' sanity"?
>
> Yes, there is. We backport the patch into earlier kernel releases and
> that action _should_ wake the distros up to take a look at the fix.
>
> This particular fix (42a886af728c089df8da1b0017b0e7e6c81b5335) was
> included in 2.6.26 and also is present in 2.6.25.17, but not 2.6.25.
> So we did backport it into 2.6.25.x. Maybe distros were slow or errant
> in picking up the patch.

It went into 2.6.25.8, which was the stable line at the time. It
didn't get backported to 2.6.24. Ubuntu's stable line is called
2.6.24-19.41, and I'm guessing they just didn't realise it was
a fix that actually needs to be applied to everything back to at
least 2.6.23.

I'll go try launchpad again and see if it's unbroken enough to
accept my bug report.

Bron ( really not wanting to keep fielding Cyrus questions on this
for the 4 1/2 years support remaining on that LTS distro! )
--
Bron Gondwana
[email protected]