LinuxLists.cc - 2.4.18 no timestamp update on modified mmapped files

2002-06-11 05:33:57

Subject: 2.4.18 no timestamp update on modified mmapped files

fd = open("foo", O_RDWR);
map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
... modify the mapped pages ...
munmap(map, size);
close(fd);

The timestamp on foo is not updated, even though the contents have
changed. Adding msync(map, size, MS_[A]SYNC) before munmap makes no
difference. 2.4.19-pre10 has no obvious fixes for this problem.

I was tearing my hair out wondering why some files were not being
rsynced. No change on size or timestamp tells rsync that the file is
"unchanged". I had to add a dummy write(map, fd, 1) to force a
timestamp update.

2002-06-16 16:35:39

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Sun, 16 Jun 2002, Kevin Easton wrote:
>
> So... the difference on i386 is just the definitions of the protection_map
> entries that are used.. specifically that PAGE_SHARED in asm-i386/pgtable.h
> includes _PAGE_RW? Changing this definition to be the same as the PAGE_COPY
> definition would be one fix?

It _could_ be _part_ of a fix (to the "problem" of dirty unbacked
pages arriving unheralded at the filesystem, too late to find
space for them; but our reluctance to have read faults allocate).

I say "could" because it would tend to cause double (read then write)
faulting more widely than necessary; I say "part" because at present
do_wp_page expects to be handling private Copy-On-Write faults rather
than shared mappings (please correct me if I'm wrong, Russell); and
we would still need to implement a callout down to the filesystem
(e.g. "wppage" method I suggested) to allocate the space (though,
doing it on the cheap, that method could be "nopage" revisited).

And I put quotes around "problem" because I'm uncertain how seriously
to take it, and we've had no chorus of anxious developers and users.

Hugh

2002-06-11 06:13:59

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

Keith Owens wrote:
>
> fd = open("foo", O_RDWR);
> map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> ... modify the mapped pages ...
> munmap(map, size);
> close(fd);
>
> The timestamp on foo is not updated, even though the contents have
> changed. Adding msync(map, size, MS_[A]SYNC) before munmap makes no
> difference. 2.4.19-pre10 has no obvious fixes for this problem.
>
> I was tearing my hair out wondering why some files were not being
> rsynced. No change on size or timestamp tells rsync that the file is
> "unchanged". I had to add a dummy write(map, fd, 1) to force a
> timestamp update.

Well it's very easy to fix. And this (untested) patch will
actually speed the kernel up (lots) because it has the smarter
mtime update logic. But note that the mtime update won't
occur for a (potentially infinitely) long time after your
program has exited. Unless you used mysnc.

What do the standards say?

http://www.opengroup.org/onlinepubs/007904975/functions/mmap.html

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync() with
MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is
no such call and if the underlying file is modified as a result of a write reference,
then these fields shall be marked for update at some time after the write reference.

This patch does exactly that. Guess I should test it.

--- 2.4.19-pre10/mm/vmscan.c~mmap-mtime Mon Jun 10 23:00:28 2002
+++ 2.4.19-pre10-akpm/mm/vmscan.c Mon Jun 10 23:04:34 2002
@@ -405,6 +405,9 @@ static int shrink_cache(int nr_pages, zo
page_cache_get(page);
spin_unlock(&pagemap_lru_lock);

+ if (!PageSwapCache(page))
+ update_mtime(page->mapping->host);
+
writepage(page);
page_cache_release(page);

--- 2.4.19-pre10/mm/filemap.c~mmap-mtime Mon Jun 10 23:00:31 2002
+++ 2.4.19-pre10-akpm/mm/filemap.c Mon Jun 10 23:05:17 2002
@@ -551,6 +551,9 @@ int filemap_fdatasync(struct address_spa
int ret = 0;
int (*writepage)(struct page *) = mapping->a_ops->writepage;

+ if (!list_empty(&mapping->dirty_pages))
+ update_mtime(mapping->host);
+
spin_lock(&pagecache_lock);

while (!list_empty(&mapping->dirty_pages)) {
@@ -3026,8 +3029,7 @@ generic_file_write(struct file *file,con
goto out;

remove_suid(inode);
- inode->i_ctime = inode->i_mtime = CURRENT_TIME;
- mark_inode_dirty_sync(inode);
+ update_mtime(inode);

if (file->f_flags & O_DIRECT)
goto o_direct;
--- 2.4.19-pre10/fs/inode.c~mmap-mtime Mon Jun 10 23:02:18 2002
+++ 2.4.19-pre10-akpm/fs/inode.c Mon Jun 10 23:03:18 2002
@@ -1195,6 +1195,16 @@ void update_atime (struct inode *inode)
mark_inode_dirty_sync (inode);
} /* End Function update_atime */

+void update_mtime(struct inode *inode)
+{
+ time_t now = CURRENT_TIME;
+
+ if (inode->i_ctime != now || inode->i_mtime != now) {
+ inode->i_ctime = now;
+ inode->i_mtime = now;
+ mark_inode_dirty_sync(inode);
+ }
+}

/*
* Quota functions that want to walk the inode lists..
--- 2.4.19-pre10/include/linux/fs.h~mmap-mtime Mon Jun 10 23:02:21 2002
+++ 2.4.19-pre10-akpm/include/linux/fs.h Mon Jun 10 23:03:41 2002
@@ -200,7 +200,8 @@ extern int leases_enable, dir_notify_ena
#include <asm/semaphore.h>
#include <asm/byteorder.h>

-extern void update_atime (struct inode *);
+void update_atime(struct inode *);
+void update_mtime(struct inode *);
#define UPDATE_ATIME(inode) update_atime (inode)

extern void buffer_init(unsigned long);

-

2002-06-11 06:30:04

by Keith Owens

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Mon, 10 Jun 2002 23:17:27 -0700,
Andrew Morton <[email protected]> wrote:
>Keith Owens wrote:
>>
>> fd = open("foo", O_RDWR);
>> map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
>> ... modify the mapped pages ...
>> munmap(map, size);
>> close(fd);
>>
>> The timestamp on foo is not updated, even though the contents have
>> changed. Adding msync(map, size, MS_[A]SYNC) before munmap makes no
>> difference. 2.4.19-pre10 has no obvious fixes for this problem.

>What do the standards say?
>
>http://www.opengroup.org/onlinepubs/007904975/functions/mmap.html
>
> The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> and PROT_WRITE shall be marked for update at some point in the interval
> between a write reference to the mapped region and the next call to msync() with
> MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is
> no such call and if the underlying file is modified as a result of a write reference,
> then these fields shall be marked for update at some time after the write reference.

That says nothing about a file where the only updates are via mmap. My
file had grown to its final size so there were no more writes, only
pages being dirtied via mmap.

2002-06-11 06:36:10

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

Andrew Morton wrote:
>
> ...
> --- 2.4.19-pre10/mm/vmscan.c~mmap-mtime Mon Jun 10 23:00:28 2002
> +++ 2.4.19-pre10-akpm/mm/vmscan.c Mon Jun 10 23:04:34 2002
> @@ -405,6 +405,9 @@ static int shrink_cache(int nr_pages, zo
> page_cache_get(page);
> spin_unlock(&pagemap_lru_lock);
>
> + if (!PageSwapCache(page))
> + update_mtime(page->mapping->host);
> +
> writepage(page);
> page_cache_release(page);
>

Actually, calling mark_inode_dirty here could well cause
the complex filesystems to explode under high VM pressure.

Probably, we should only touch the file inside msync()
(in 2.4 kernels, at least).

-

2002-06-11 06:45:26

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

Keith Owens wrote:
>
> On Mon, 10 Jun 2002 23:17:27 -0700,
> Andrew Morton <[email protected]> wrote:
> >Keith Owens wrote:
> >>
> >> fd = open("foo", O_RDWR);
> >> map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> >> ... modify the mapped pages ...
> >> munmap(map, size);
> >> close(fd);
> >>
> >> The timestamp on foo is not updated, even though the contents have
> >> changed. Adding msync(map, size, MS_[A]SYNC) before munmap makes no
> >> difference. 2.4.19-pre10 has no obvious fixes for this problem.
>
> >What do the standards say?
> >
> >http://www.opengroup.org/onlinepubs/007904975/functions/mmap.html
> >
> > The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> > and PROT_WRITE shall be marked for update at some point in the interval
> > between a write reference to the mapped region and the next call to msync() with
> > MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is
> > no such call and if the underlying file is modified as a result of a write reference,
> > then these fields shall be marked for update at some time after the write reference.
>
> That says nothing about a file where the only updates are via mmap. My
> file had grown to its final size so there were no more writes, only
> pages being dirtied via mmap.

It is specifically referring to updates via mmap! "a write reference
to the mapped region". This is the mmap documentation.

What you want is what the standard says we should do. And we aren't
doing it. Your application should perform msync(MS_ASYNC) and the
mtime should be updated.

-

2002-06-11 07:07:45

by Keith Owens

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Mon, 10 Jun 2002 23:49:02 -0700,
Andrew Morton <[email protected]> wrote:
>Keith Owens wrote:
>> On Mon, 10 Jun 2002 23:17:27 -0700,
>> Andrew Morton <[email protected]> wrote:
>> > The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
>> > and PROT_WRITE shall be marked for update at some point in the interval
>> > between a write reference to the mapped region and the next call to msync() with
>> > MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is
>> > no such call and if the underlying file is modified as a result of a write reference,
>> > then these fields shall be marked for update at some time after the write reference.
>>
>> That says nothing about a file where the only updates are via mmap. My
>> file had grown to its final size so there were no more writes, only
>> pages being dirtied via mmap.
>
>It is specifically referring to updates via mmap! "a write reference
>to the mapped region". This is the mmap documentation.

I saw "write reference" and my brain translated that to "write()". I
blame the long weekend.

2002-06-11 07:24:57

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

Keith Owens wrote:
>
> On Mon, 10 Jun 2002 23:49:02 -0700,
> Andrew Morton <[email protected]> wrote:
> >Keith Owens wrote:
> >> On Mon, 10 Jun 2002 23:17:27 -0700,
> >> Andrew Morton <[email protected]> wrote:
> >> > The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> >> > and PROT_WRITE shall be marked for update at some point in the interval
> >> > between a write reference to the mapped region and the next call to msync() with
> >> > MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is
> >> > no such call and if the underlying file is modified as a result of a write reference,
> >> > then these fields shall be marked for update at some time after the write reference.
> >>
> >> That says nothing about a file where the only updates are via mmap. My
> >> file had grown to its final size so there were no more writes, only
> >> pages being dirtied via mmap.
> >
> >It is specifically referring to updates via mmap! "a write reference
> >to the mapped region". This is the mmap documentation.
>
> I saw "write reference" and my brain translated that to "write()". I
> blame the long weekend.

That'll be a left-brain/write-brain thing.

I think it's too late to fix this in 2.4. If we did, a person
could develop and test an application on 2.4.21, ship it, then
find that it fails on millions of 2.4.17 machines.

-

2002-06-11 09:12:03

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Tue, 11 Jun 2002, Andrew Morton wrote:
>
> I think it's too late to fix this in 2.4. If we did, a person
> could develop and test an application on 2.4.21, ship it, then
> find that it fails on millions of 2.4.17 machines.

Oh, please reconsider that! Doesn't loss of modification time
approach data loss? Surely we'll continue to fix any data loss
issues in 2.4, and be grateful if you fixed this mmap modtime loss.

Hugh

2002-06-11 09:30:19

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

Hugh Dickins wrote:
>
> On Tue, 11 Jun 2002, Andrew Morton wrote:
> >
> > I think it's too late to fix this in 2.4. If we did, a person
> > could develop and test an application on 2.4.21, ship it, then
> > find that it fails on millions of 2.4.17 machines.
>
> Oh, please reconsider that! Doesn't loss of modification time
> approach data loss? Surely we'll continue to fix any data loss
> issues in 2.4, and be grateful if you fixed this mmap modtime loss.
>

Oh that's easy - I'll complete the patch and let Marcelo worry
about it.

-

2002-06-11 09:51:58

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Tue, 11 Jun 2002, Andrew Morton wrote:
>
> Oh that's easy - I'll complete the patch and let Marcelo worry
> about it.

Thanks!
Hugh

2002-06-11 18:09:32

by Robert Love

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Tue, 2002-06-11 at 00:28, Andrew Morton wrote:

> That'll be a left-brain/write-brain thing.

Illegal instruction!

I hope everyone else saw this and just refrained from comment. Thanks
for making my morning ;-)

Robert Love

2002-06-12 07:20:39

by Alan

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

> On Tue, 11 Jun 2002, Andrew Morton wrote:
> >
> > I think it's too late to fix this in 2.4. If we did, a person
> > could develop and test an application on 2.4.21, ship it, then
> > find that it fails on millions of 2.4.17 machines.
>
> Oh, please reconsider that! Doesn't loss of modification time
> approach data loss? Surely we'll continue to fix any data loss
> issues in 2.4, and be grateful if you fixed this mmap modtime loss.

It doesnt approach data loss, when doing incremental backups it
*is* data loss. Ditto with rsync --newer

2002-06-12 07:52:40

by Andrew Morton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

Alan Cox wrote:
>
> > On Tue, 11 Jun 2002, Andrew Morton wrote:
> > >
> > > I think it's too late to fix this in 2.4. If we did, a person
> > > could develop and test an application on 2.4.21, ship it, then
> > > find that it fails on millions of 2.4.17 machines.
> >
> > Oh, please reconsider that! Doesn't loss of modification time
> > approach data loss? Surely we'll continue to fix any data loss
> > issues in 2.4, and be grateful if you fixed this mmap modtime loss.
>
> It doesnt approach data loss, when doing incremental backups it
> *is* data loss. Ditto with rsync --newer

A more serious form of data loss occurs when an application has a shared
mapping over a sparse file. If the filesystem is out of space when
the VM decides to write back some pages, your data simply gets dropped
on the floor. Even a subsequent msync() won't tell you that you have
a shiny new bunch of zeroes in your file.

It's not simple to fix. Approaches might be:

1: Map the page to disk at fault time, generate SIGBUS on
ENOSPC (the standards don't seem to address this issue, and
this is a non-standard overload of SIGBUS).

2: Resurrect delayed-allocation patches, use their reservation
API to generate the SIGBUS.

3: Record the fact that there has been a data loss in the mapping
and return that information to a subsequent msync(). (I have
most-of-a-patch for this. It's fairly murky).

4: Dirty the page again if writepage() failed. This fills the machine
up with unfreeable pages, but emitting ENOSPC messages into the
logs may be acceptable - the operator makes some space, the data
gets written and the messages stop.

-

2002-06-12 14:53:20

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Wed, 12 Jun 2002, Andrew Morton wrote:
>
> A more serious form of data loss occurs when an application has a shared
> mapping over a sparse file. If the filesystem is out of space when
> the VM decides to write back some pages, your data simply gets dropped
> on the floor. Even a subsequent msync() won't tell you that you have
> a shiny new bunch of zeroes in your file.
>
> It's not simple to fix. Approaches might be:
>
> 1: Map the page to disk at fault time, generate SIGBUS on
> ENOSPC (the standards don't seem to address this issue, and
> this is a non-standard overload of SIGBUS).

I've looked at this issue in the past: it's a familiar problem
for various filesystems on various flavours of UNIX. Some of the
strangeness in tmpfs (shmem_recalc_inode, or ac's shmem_removepage)
can be traced to this issue, I believe. The filesystem does not
know when a clean page is dirtied, and somehow has to cope afterwards.

I believe your option 1 is closest to the right direction; and SIGBUS
is entirely appropriate, I don't see it as a non-standard overload.

But you didn't spell out the worst news on that option: read faults
into a read-only shared mapping of a file which the application had
open for read-write when it mmapped: the page must be mapped to disk
at read fault time (because the mapping just might be mprotected for
read-write later on, and the page then dirtied).

Most apps would have opened the file read-only anyway, and no
problem then. Perhaps it's acceptable to penalize those that don't;
but it does seem distasteful to have to desparsify a file when
accessing it through a read-only mapping.

What I wanted was a "wppage" method (I seem to recall stealing the
name from a different now defunct method) in vm_operations_struct:
int (*wppage)(struct vm_area_struct * area, struct page * page, int write_now);

This was called by do_no_page after calling FS nopage, when mapping
shared writable, the write_now flag true if write fault. If write_now,
FS wppage would return success if page already backed, or hole now backed,
and do_no_page would give write permission to the pte, failure SIGBUS;
if not write_now, FS wppage could decide for itself whether to back
hole now (success) or defer (failure, no SIGBUS, but write permission
withheld from pte). Code in do_wp_page to call FS wppage again if
shared mapping, to allocate if deferred. Code in mprotect_fixup to
withhold write permission from shared mapping if FS has wppage method.

Hugh

2002-06-13 02:25:35

by jw schultz

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Wed, Jun 12, 2002 at 03:52:34PM +0100, Hugh Dickins wrote:
> On Wed, 12 Jun 2002, Andrew Morton wrote:
> >
> > A more serious form of data loss occurs when an application has a shared
> > mapping over a sparse file. If the filesystem is out of space when
> > the VM decides to write back some pages, your data simply gets dropped
> > on the floor. Even a subsequent msync() won't tell you that you have
> > a shiny new bunch of zeroes in your file.
> >
> > It's not simple to fix. Approaches might be:
> >
> > 1: Map the page to disk at fault time, generate SIGBUS on
> > ENOSPC (the standards don't seem to address this issue, and
> > this is a non-standard overload of SIGBUS).
>
> I believe your option 1 is closest to the right direction; and SIGBUS
> is entirely appropriate, I don't see it as a non-standard overload.

I concur that #1 is closest. I'd prefer it to happen on a
write fault rather read but the frequency with which
this should occur is low enough i wouldn't sweat it.

It is a non-standard overload of SIGBUS. SIGBUS is to
indicate an unaligned memory access or otherwise malformed
address. Many confuse SIGBUS with SIGSEGV because they are
usually symptoms of the same problems but a SIGSEGV is to
indicate memory protection violation (unresolvable page
fault) which is not the same as a malformed address. I
believe Linux, at least on x86 maps both errors to SIGSEGV.
I would think SIGXFSZ might be a better fit.

>
> But you didn't spell out the worst news on that option: read faults
> into a read-only shared mapping of a file which the application had
> open for read-write when it mmapped: the page must be mapped to disk
> at read fault time (because the mapping just might be mprotected for
> read-write later on, and the page then dirtied).
>
>

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-06-13 09:59:00

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Wed, 12 Jun 2002, jw schultz wrote:
> On Wed, Jun 12, 2002 at 03:52:34PM +0100, Hugh Dickins wrote:
> > On Wed, 12 Jun 2002, Andrew Morton wrote:
> > >
> > > 1: Map the page to disk at fault time, generate SIGBUS on
> > > ENOSPC (the standards don't seem to address this issue, and
> > > this is a non-standard overload of SIGBUS).
> >
> > I believe your option 1 is closest to the right direction; and SIGBUS
> > is entirely appropriate, I don't see it as a non-standard overload.
>
> I concur that #1 is closest. I'd prefer it to happen on a
> write fault rather read but the frequency with which
> this should occur is low enough i wouldn't sweat it.
>
> It is a non-standard overload of SIGBUS. SIGBUS is to
> indicate an unaligned memory access or otherwise malformed
> address. Many confuse SIGBUS with SIGSEGV because they are
> usually symptoms of the same problems but a SIGSEGV is to
> indicate memory protection violation (unresolvable page
> fault) which is not the same as a malformed address. I
> believe Linux, at least on x86 maps both errors to SIGSEGV.
> I would think SIGXFSZ might be a better fit.

No. I think you're looking back too far in UNIX history.
I imagine SIGBUS was originally defined as you describe,
but got hijacked by the inventors of the mmap system call
(only a limited number of signals available). That overload
has been enshrined in standards for ten(?) years.

SIGSEGV is used where mapping itself cannot be accessed (no mapping
or insufficient permission); SIGBUS where mapped object cannot be
accessed - I/O error or, more usually, beyond end of (last page of)
file. Linux just follows the standards on those.

It would be inappropriate to use anything but SIGBUS for no space.

Hugh

2002-06-13 10:40:11

by jw schultz

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Thu, Jun 13, 2002 at 10:58:39AM +0100, Hugh Dickins wrote:
> On Wed, 12 Jun 2002, jw schultz wrote:
> > On Wed, Jun 12, 2002 at 03:52:34PM +0100, Hugh Dickins wrote:
> > > On Wed, 12 Jun 2002, Andrew Morton wrote:
> > > >
> > > > 1: Map the page to disk at fault time, generate SIGBUS on
> > > > ENOSPC (the standards don't seem to address this issue, and
> > > > this is a non-standard overload of SIGBUS).
> > >
> > > I believe your option 1 is closest to the right direction; and SIGBUS
> > > is entirely appropriate, I don't see it as a non-standard overload.
> >
> > I concur that #1 is closest. I'd prefer it to happen on a
> > write fault rather read but the frequency with which
> > this should occur is low enough i wouldn't sweat it.
> >
> > It is a non-standard overload of SIGBUS. SIGBUS is to
> > indicate an unaligned memory access or otherwise malformed
> > address. Many confuse SIGBUS with SIGSEGV because they are
> > usually symptoms of the same problems but a SIGSEGV is to
> > indicate memory protection violation (unresolvable page
> > fault) which is not the same as a malformed address. I
> > believe Linux, at least on x86 maps both errors to SIGSEGV.
> > I would think SIGXFSZ might be a better fit.
>
> No. I think you're looking back too far in UNIX history.
> I imagine SIGBUS was originally defined as you describe,
> but got hijacked by the inventors of the mmap system call
> (only a limited number of signals available). That overload
> has been enshrined in standards for ten(?) years.

Perhaps. I guess i'm showing my age but i remember when the
hardware MMU would generate a "buss error" and more than
once the distinction between buss error and segmentation
violation actually pointed to the programming error. If so
it is way past time to alias SIGBUS and deprecate the old
name. We're also overdue for fixing the manpages where
signal(7) defines SIGBUS as "Bus error (bad memory access)"
which has nothing to do with space availabilty.

> SIGSEGV is used where mapping itself cannot be accessed (no mapping
> or insufficient permission); SIGBUS where mapped object cannot be
> accessed - I/O error or, more usually, beyond end of (last page of)
> file. Linux just follows the standards on those.
>
> It would be inappropriate to use anything but SIGBUS for no space.

I would dispute that; as the only signal that even hints at
out-of-space is SIGXFSZ which is why i mentioned it. Since
we are going beyond POSIX and SUS we could use an altogether
new signal but that is a much bigger discussion. SIGFULL
anyone? I seem to recall a discussion regarding a mechanism
that would allow notifying processes that memory is tight,
but that should by default ignore not terminate so should not
shared with this.

The main thing is the signal should be catchable and by
default should terminate without core. As long as a corrupt
pointer doesn't cause the same signal as running out of
space i'm ok with it.

I've said my piece and I'll shut up now.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-06-13 19:12:09

by Pavel Machek

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

Hi!

> > >It is specifically referring to updates via mmap! "a write reference
> > >to the mapped region". This is the mmap documentation.
> >
> > I saw "write reference" and my brain translated that to "write()". I
> > blame the long weekend.
>
> That'll be a left-brain/write-brain thing.
>
> I think it's too late to fix this in 2.4. If we did, a person
> could develop and test an application on 2.4.21, ship it, then
> find that it fails on millions of 2.4.17 machines.

It is a bug, so it should be fixed. If someone develops on FreeBSD then
he'll be surprised it does not work on Linux... Stable should mean "only
bugfixes" and this certainly looks like a bug.

--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2002-06-15 05:25:04

by Kevin Easton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

> On Wed, 12 Jun 2002, Andrew Morton wrote:
> >
> > A more serious form of data loss occurs when an application has a shared
> > mapping over a sparse file. If the filesystem is out of space when
> > the VM decides to write back some pages, your data simply gets dropped
> > on the floor. Even a subsequent msync() won't tell you that you have
> > a shiny new bunch of zeroes in your file.
> >
> > It's not simple to fix. Approaches might be:
> >
> > 1: Map the page to disk at fault time, generate SIGBUS on
> > ENOSPC (the standards don't seem to address this issue, and
> > this is a non-standard overload of SIGBUS).
>
>
> I've looked at this issue in the past: it's a familiar problem
> for various filesystems on various flavours of UNIX. Some of the
> strangeness in tmpfs (shmem_recalc_inode, or ac's shmem_removepage)
> can be traced to this issue, I believe. The filesystem does not
> know when a clean page is dirtied, and somehow has to cope afterwards.
>
>
> I believe your option 1 is closest to the right direction; and SIGBUS
> is entirely appropriate, I don't see it as a non-standard overload.
>
>
> But you didn't spell out the worst news on that option: read faults
> into a read-only shared mapping of a file which the application had
> open for read-write when it mmapped: the page must be mapped to disk
> at read fault time (because the mapping just might be mprotected for
> read-write later on, and the page then dirtied).
>

Can't the page be mapped to disk at the page-dirtying-fault time? I
was under the impression that even after the mapping has been mprotected
for read-write, the first write to each page will still cause a page
fault that results in the page being marked dirty.

- Kevin.

2002-06-15 08:10:34

by Hugh Dickins

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Sat, 15 Jun 2002, Kevin Easton wrote:
> On Wed, 12 Jun 2002, Hugh Dickins wrote:
> >
> > But you didn't spell out the worst news on that option: read faults
> > into a read-only shared mapping of a file which the application had
> > open for read-write when it mmapped: the page must be mapped to disk
> > at read fault time (because the mapping just might be mprotected for
> > read-write later on, and the page then dirtied).
>
> Can't the page be mapped to disk at the page-dirtying-fault time? I
> was under the impression that even after the mapping has been mprotected
> for read-write, the first write to each page will still cause a page
> fault that results in the page being marked dirty.

It depends on the history of the mapping. mprotect() does not fault in
any new pages, it just changes permissions on page table entries already
present. So, if you're talking about a fresh mapping, or an area of a
mapping which has not yet been accessed, you're correct. And you're
correct if you're talking about a private mapping (which needs write
protection to do copy-on-write). But those aren't cases of concern here.

In general, there will already be some page table entries present,
and mprotect() from shared readonly to readwrite currently adds write
permission to those entries, and no write fault will then occur on
first write to those pages. I was suggesting that we'd need to change
that (to the behaviour you expect) if we were trying to guarantee disk
space for unbacked dirty pages (without allocating on read fault).

(I'm referring above to the implementation in Linux 2.4 or 2.5:
I've not checked other releases or OSes, which could indeed arrange
permissions so that there's always a page-dirtying fault.)

Hugh

2002-06-15 09:12:39

by Kevin Easton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Sat, 15 Jun 2002, Hugh Dickins wrote:
> On Sat, 15 Jun 2002, Kevin Easton wrote:
> > On Wed, 12 Jun 2002, Hugh Dickins wrote:
> > >
> > > But you didn't spell out the worst news on that option: read faults
> > > into a read-only shared mapping of a file which the application had
> > > open for read-write when it mmapped: the page must be mapped to disk
> > > at read fault time (because the mapping just might be mprotected for
> > > read-write later on, and the page then dirtied).
> >
> > Can't the page be mapped to disk at the page-dirtying-fault time? I
> > was under the impression that even after the mapping has been mprotected
> > for read-write, the first write to each page will still cause a page
> > fault that results in the page being marked dirty.
>
> It depends on the history of the mapping. mprotect() does not fault in
> any new pages, it just changes permissions on page table entries already
> present. So, if you're talking about a fresh mapping, or an area of a
> mapping which has not yet been accessed, you're correct. And you're
> correct if you're talking about a private mapping (which needs write
> protection to do copy-on-write). But those aren't cases of concern here.
>
> In general, there will already be some page table entries present,
> and mprotect() from shared readonly to readwrite currently adds write
> permission to those entries, and no write fault will then occur on
> first write to those pages. I was suggesting that we'd need to change
> that (to the behaviour you expect) if we were trying to guarantee disk
> space for unbacked dirty pages (without allocating on read fault).
>
> (I'm referring above to the implementation in Linux 2.4 or 2.5:
> I've not checked other releases or OSes, which could indeed arrange
> permissions so that there's always a page-dirtying fault.)
>
> Hugh
>

Hmm.. so how do such pages get marked dirty on architectures that don't
do it in hardware ("most RISC architectures" according to a comment in
memory.c)? Is the entire mapping made dirty when the write permissions
are added?

- Kevin

2002-06-15 09:23:56

by Russell King

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Sat, Jun 15, 2002 at 07:12:30PM +1000, Kevin Easton wrote:
> Hmm.. so how do such pages get marked dirty on architectures that don't
> do it in hardware ("most RISC architectures" according to a comment in
> memory.c)? Is the entire mapping made dirty when the write permissions
> are added?

No. You only give user space write access when the write access _and_
"Linux dirty bit" are set. This means you fault when user space tries
to write to the page, which means you can set the dirty bit. This is
what the following code is doing (if write_access is required and the
pte already has write permission, then set the dirty bit):

if (write_access) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address, pte, pmd, entry);

entry = pte_mkdirty(entry);
}

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-06-16 04:35:19

by Kevin Easton

[permalink] [raw]

Subject: Re: 2.4.18 no timestamp update on modified mmapped files

On Sat, 15 Jun 2002, Hugh Dickins wrote:
> On Sat, 15 Jun 2002, Russell King wrote:
> > On Sat, Jun 15, 2002 at 07:12:30PM +1000, Kevin Easton wrote:
> > > Hmm.. so how do such pages get marked dirty on architectures that don't
> > > do it in hardware ("most RISC architectures" according to a comment in
> > > memory.c)? Is the entire mapping made dirty when the write permissions
> > > are added?
> >
> > No. You only give user space write access when the write access _and_
> > "Linux dirty bit" are set. This means you fault when user space tries
> > to write to the page, which means you can set the dirty bit. This is
> > what the following code is doing (if write_access is required and the
> > pte already has write permission, then set the dirty bit):
> >
> > if (write_access) {
> > if (!pte_write(entry))
> > return do_wp_page(mm, vma, address, pte, pmd, entry);
> >
> > entry = pte_mkdirty(entry);
> > }
>
> Thanks, Russell. Sorry, Kevin: I wasn't even thinking about
> non-i86 cases, add that to the list of disclaimers I put in.
>
> Hugh

So... the difference on i386 is just the definitions of the protection_map
entries that are used.. specifically that PAGE_SHARED in asm-i386/pgtable.h
includes _PAGE_RW? Changing this definition to be the same as the PAGE_COPY
definition would be one fix?

- Kevin.