2004-09-15 12:31:25

by Andrea Arcangeli

[permalink] [raw]
Subject: truncate shows non zero data beyond the end of the inode with MAP_SHARED

Hello,

I've been told we're not posix compliant the way we handle MAP_SHARED
on the last page of the inode. Basically after we map the page into
userspace people can make the data beyond the i_size non-zero and we
should clear it in the transition from page_mapcount 1 -> 0. The bug
is that if you truncate-extend, the new data will not be guaranteed to
be zero.

msync + power outage and writing to the page with sys_write at the same
time it's being mapped (and in turn queueing it for pdflush writeout)
are the two worst offeners. To fix those we'd need to mark the pte
readonly, flush the tlb with a worst-case IPI broadcast, writepage, then
mark the pte read-write and flush the tlb again with another IPI
broadcast.

That is going to have a significant cost methinks. So maybe we shouldn't
fix it after all...


2004-09-15 13:53:26

by Alan

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Mer, 2004-09-15 at 13:29, Andrea Arcangeli wrote:
> I've been told we're not posix compliant the way we handle MAP_SHARED
> on the last page of the inode. Basically after we map the page into
> userspace people can make the data beyond the i_size non-zero and we
> should clear it in the transition from page_mapcount 1 -> 0. The bug
> is that if you truncate-extend, the new data will not be guaranteed to
> be zero.

I've heard this a couple of times but in fact SuS v3 says

--
If the size of the mapped file changes after the call to mmap() as a
result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added or
removed portions of the file is unspecified.
--

Note it says "after the call to mmap" not after the file size change.

The guarantees it does make are:
- After the mmap completes the bytes in the end area after EOF are zero
- The bytes in question are not written back to the file
[although a file size change hits the first quoted rule]

Essentially the rule is "don't extend the file on writes to the gap"

BTW: there is no such thing as a useful SuSv3 implementation of mmap
because the documentation contains an error which requires every
reference to the mmapped object cause a bus error 8)



2004-09-15 21:05:29

by William Lee Irwin III

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Wed, Sep 15, 2004 at 02:29:20PM +0200, Andrea Arcangeli wrote:
> I've been told we're not posix compliant the way we handle MAP_SHARED
> on the last page of the inode. Basically after we map the page into
> userspace people can make the data beyond the i_size non-zero and we
> should clear it in the transition from page_mapcount 1 -> 0. The bug
> is that if you truncate-extend, the new data will not be guaranteed to
> be zero.
> msync + power outage and writing to the page with sys_write at the same
> time it's being mapped (and in turn queueing it for pdflush writeout)
> are the two worst offeners. To fix those we'd need to mark the pte
> readonly, flush the tlb with a worst-case IPI broadcast, writepage, then
> mark the pte read-write and flush the tlb again with another IPI
> broadcast.
> That is going to have a significant cost methinks. So maybe we shouldn't
> fix it after all...

Zeroing the final partial page during expanding truncate (flushing TLB)
sounds like a reasonable half measure; we don't do anything at the moment.
We could even, say, do a pass of pte shootdown given try_to_unmap() and
force minor faults to make all this less likely.


-- wli

2004-09-15 21:53:45

by Andrew Morton

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

William Lee Irwin III <[email protected]> wrote:
>
> Zeroing the final partial page during expanding truncate (flushing TLB)
> sounds like a reasonable half measure; we don't do anything at the moment.

Sure about that? block_truncate_page() gets called.

2004-09-15 22:01:37

by William Lee Irwin III

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

William Lee Irwin III <[email protected]> wrote:
>> Zeroing the final partial page during expanding truncate (flushing TLB)
>> sounds like a reasonable half measure; we don't do anything at the moment.

On Wed, Sep 15, 2004 at 02:55:24PM -0700, Andrew Morton wrote:
> Sure about that? block_truncate_page() gets called.

So it does; then the hard parts are what's biting aa.


-- wli

2004-09-15 22:08:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Wed, Sep 15, 2004 at 02:55:24PM -0700, Andrew Morton wrote:
> William Lee Irwin III <[email protected]> wrote:
> >
> > Zeroing the final partial page during expanding truncate (flushing TLB)
> > sounds like a reasonable half measure; we don't do anything at the moment.
>
> Sure about that? block_truncate_page() gets called.

block_truncate_page is executed on the _new_ partial page, what we're
talking about here is the _old_ partial page, the partial page before
calling truncate. That one isn't zeroed out, and zeroing it out would
require marking it dirty too since the garbage could be flushed to disk
already. That's why I was sticking with a solution that would leave
truncate a no-I/O operation as far as a data is concerned (and in
general a solution that would never generate any further I/O compared to
current kernel).

Probably we can ignore this thanks to Alan's feedback.

(also note, we should talk about partial blocks here, not partial pages,
partial pages isn't the issue if the i_size is softblocksize aligned,
but let's assume 4k softblocksize on a x86 or x86-64 for clarity so that
pages is the same as softblocksize ;)

2004-09-15 22:13:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Wed, Sep 15, 2004 at 02:01:06PM -0700, William Lee Irwin III wrote:
> Zeroing the final partial page during expanding truncate (flushing TLB)
> sounds like a reasonable half measure; we don't do anything at the moment.

I was only investingating solutions that would guarantee to never write
non-zero data on-disk in the last partial block of the inode, and in
turn solutions that would never trigger suprious I/O on disk.

Your suggestion is the other way around, that is we keep writing
non-zero in the half block, but exactly because of that, your "Zeroing
the final partial page", will have to mark the page dirty after that
(and probably trigger a COW if it was a MAP_PRIVATE). Which means
_I/O_. So I think it'd be worse than my solution I suggested that
doesn't risk to write anything zero beyond the end of the file, and that
guarantees no suprious I/O will ever happen. IMHO I/O can be a lot more
costly than a double objrmap-driven pagetable walk + tlb flush.

However even doing the double pagetable walk + tlbflush around
writepages of partial pages, isn't nice at all. So I'm not sure if we've
to fix this at all.

I understood from Alan we probably shouldn't care to be posix compliant
(he nicely pointed out we're already SuSv3 compliant, which sounds good
enough) and we can keep going fast.

2004-09-15 22:13:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Wed, Sep 15, 2004 at 03:00:16PM -0700, William Lee Irwin III wrote:
> William Lee Irwin III <[email protected]> wrote:
> >> Zeroing the final partial page during expanding truncate (flushing TLB)
> >> sounds like a reasonable half measure; we don't do anything at the moment.
>
> On Wed, Sep 15, 2004 at 02:55:24PM -0700, Andrew Morton wrote:
> > Sure about that? block_truncate_page() gets called.
>
> So it does; then the hard parts are what's biting aa.

block_truncate_page is unrelated with this issue, it's called on the
_new_ partial page generated by truncate, not on the _old_ partial page
that is being extended to be a _full_ page (with garbage inside between
the old_i_size and PAGE_ALIGN(old_i_size)).

(again s/page/softblock/)

2004-09-16 08:44:51

by Helge Hafting

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

Andrea Arcangeli wrote:

>On Wed, Sep 15, 2004 at 03:00:16PM -0700, William Lee Irwin III wrote:
>
>
>>William Lee Irwin III <[email protected]> wrote:
>>
>>
>>>>Zeroing the final partial page during expanding truncate (flushing TLB)
>>>>sounds like a reasonable half measure; we don't do anything at the moment.
>>>>
>>>>
>>On Wed, Sep 15, 2004 at 02:55:24PM -0700, Andrew Morton wrote:
>>
>>
>>>Sure about that? block_truncate_page() gets called.
>>>
>>>
>>So it does; then the hard parts are what's biting aa.
>>
>>
>
>block_truncate_page is unrelated with this issue, it's called on the
>_new_ partial page generated by truncate, not on the _old_ partial page
>that is being extended to be a _full_ page (with garbage inside between
>the old_i_size and PAGE_ALIGN(old_i_size)).
>
>
Could this "garbage" possibly be confidential data?
I.e. one user repeatedly makes and mmaps a 1-byte file,
extends it to 4k, and looks at the 4095 bytes of "garbage".
Maybe he finds some "interesting stuff" when someone else's
confidential file just got dropped from pagecache
so he could mmap this 1-byte file?

Helge Hafting

2004-09-16 14:37:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Thu, Sep 16, 2004 at 10:49:33AM +0200, Helge Hafting wrote:
> Could this "garbage" possibly be confidential data?

I don't buy much in this theory.

> I.e. one user repeatedly makes and mmaps a 1-byte file,
> extends it to 4k, and looks at the 4095 bytes of "garbage".
> Maybe he finds some "interesting stuff" when someone else's
> confidential file just got dropped from pagecache
> so he could mmap this 1-byte file?

the old data got flushed below the i_size anyways, it sounds very
strange that confidential data is present only over the i_size and not
below the i_size, and if this guy has confidential data below the i_size
then it'd better memset the whole page. And in theory nobody should touch
the data over the i_size even if mmap allows to map it.

2004-09-17 13:45:24

by Helge Hafting

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

Andrea Arcangeli wrote:

>On Thu, Sep 16, 2004 at 10:49:33AM +0200, Helge Hafting wrote:
>
>
>>Could this "garbage" possibly be confidential data?
>>
>>
>
>I don't buy much in this theory.
>
>
>
>>I.e. one user repeatedly makes and mmaps a 1-byte file,
>>extends it to 4k, and looks at the 4095 bytes of "garbage".
>>Maybe he finds some "interesting stuff" when someone else's
>>confidential file just got dropped from pagecache
>>so he could mmap this 1-byte file?
>>
>>
>
>the old data got flushed below the i_size anyways, it sounds very
>strange that confidential data is present only over the i_size and not
>below the i_size, and if this guy has confidential data below the i_size
>then it'd better memset the whole page. And in theory nobody should touch
>the data over the i_size even if mmap allows to map it.
>
>
I am not talking about someone accidentally stumbling onto
something. I was worried about someone deliberately
trying to exploit this - such people look at data above i_size
_because they can_, hoping to find something interesting there.
Something they cannot get at normally.

I am assuming that the "garbage" between i_size and the
page boundary is stuff left over from whatever that
memory page was used for earlier? If so, it could be
4095 bytes out of the 4096 that was used to cache some
other file earlier. Possibly someone else's confidential file.
Or a piece of some network package that was processed a while ago.

Helge Hafting


2004-09-17 13:53:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Fri, Sep 17, 2004 at 03:49:18PM +0200, Helge Hafting wrote:
> I am assuming that the "garbage" between i_size and the
> page boundary is stuff left over from whatever that
> memory page was used for earlier? If so, it could be
> 4095 bytes out of the 4096 that was used to cache some
> other file earlier. Possibly someone else's confidential file.
> Or a piece of some network package that was processed a while ago.

I see what you mean now, but this is not the case, the new page is
cleared by block_truncate_page, as Andrew pointed out. So when we get a
new partial page we zero out the content beyond the end of the inode.
It's when we extend again that we don't do anything on such partial page
previously processed and zeroed-out by block_truncate_page. So there
can't be some network data processed a while ago, or other random memory
content. It's just up to the application to be correct and not to write
its own data beyond the end of the i_size if it doesn't want this data
to hit disk (and other data will be written anyways below the i_size,
which has the same issue). So basically I don't see any security issue.

2004-09-17 13:54:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: truncate shows non zero data beyond the end of the inode with MAP_SHARED

On Fri, Sep 17, 2004 at 03:49:18PM +0200, Helge Hafting wrote:
> I am not talking about someone accidentally stumbling onto
> something. I was worried about someone deliberately
> trying to exploit this - such people look at data above i_size
> _because they can_, hoping to find something interesting there.
> Something they cannot get at normally.
> I am assuming that the "garbage" between i_size and the
> page boundary is stuff left over from whatever that
> memory page was used for earlier? If so, it could be
> 4095 bytes out of the 4096 that was used to cache some
> other file earlier. Possibly someone else's confidential file.
> Or a piece of some network package that was processed a while ago.

This issue is only of userspace data leaking into pagecache at file
offsets just beyond where the end of a file formerly was (no further
than the former last page, but beyond the former end of the file), not
kernel data leaking to userspace.


-- wli