Hello kernel hackers!
The current implementation of mmap() in kernel is very convenient.
It allows to mmap(fd) very big amount of memory having small file as back-end.
So one can mmap() 100 MiB on empty file, use first 10 KiB of memory, munmap() and have
only 10 KiB of file at the end. And while working with memory, file will automatically be
grown by read/write memory requests.
Question is: can user-space application rely on this behavior (I failed to find any
documentation about this)?
TIA and please CC me in replies.
--
Eugene V. Lyubimkin aka JackYF
On Mon, 2008-11-03 at 23:57 +0200, Eugene V. Lyubimkin wrote:
> Hello kernel hackers!
>
> The current implementation of mmap() in kernel is very convenient.
> It allows to mmap(fd) very big amount of memory having small file as back-end.
> So one can mmap() 100 MiB on empty file, use first 10 KiB of memory, munmap() and have
> only 10 KiB of file at the end. And while working with memory, file will automatically be
> grown by read/write memory requests.
>
> Question is: can user-space application rely on this behavior (I failed to find any
> documentation about this)?
>
> TIA and please CC me in replies.
mmap() writes past the end of the file should not grow the file if I
understand things write, but produce a sigbus (after the first page size
alignment).
The exact interaction of mmap() and truncate() I'm not exactly clear on.
The safe way to do things is to first create your file of at least the
size you mmap, using truncate. This will create a sparse file, and will
on any sane filesystem not take more space than its meta data.
Thereafter you can fill it with writes to the mmap.
Peter Zijlstra wrote:
> On Mon, 2008-11-03 at 23:57 +0200, Eugene V. Lyubimkin wrote:
>> Hello kernel hackers!
>>
>> The current implementation of mmap() in kernel is very convenient.
>> It allows to mmap(fd) very big amount of memory having small file as back-end.
>> So one can mmap() 100 MiB on empty file, use first 10 KiB of memory, munmap() and have
>> only 10 KiB of file at the end. And while working with memory, file will automatically be
>> grown by read/write memory requests.
>>
>> Question is: can user-space application rely on this behavior (I failed to find any
>> documentation about this)?
>>
>> TIA and please CC me in replies.
>
> mmap() writes past the end of the file should not grow the file if I
> understand things write, but produce a sigbus (after the first page size
> alignment).
Indeed, faulting beyond the end of file returns a SIGBUS,
see these lines in mm/filemap.c:filemap_fault():
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
PAGE_CACHE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
> The exact interaction of mmap() and truncate() I'm not exactly clear on.
Truncate will reduce the size of the mmaps on the file to
match the new file size, so processes accessing beyond the
end of file will get a segmentation fault (SIGSEGV).
> The safe way to do things is to first create your file of at least the
> size you mmap, using truncate. This will create a sparse file, and will
> on any sane filesystem not take more space than its meta data.
>
> Thereafter you can fill it with writes to the mmap.
Agreed.
--
All Rights Reversed
Rik van Riel wrote:
> Peter Zijlstra wrote:
>> The exact interaction of mmap() and truncate() I'm not exactly clear on.
>
> Truncate will reduce the size of the mmaps on the file to
> match the new file size, so processes accessing beyond the
> end of file will get a segmentation fault (SIGSEGV).
I suspect Peter was talking about using truncate() to set the initial
file size, effectively increasing rather than reducing it.
Chris
On Tue, 2008-11-04 at 09:56 -0600, Chris Friesen wrote:
> Rik van Riel wrote:
> > Peter Zijlstra wrote:
>
> >> The exact interaction of mmap() and truncate() I'm not exactly clear on.
> >
> > Truncate will reduce the size of the mmaps on the file to
> > match the new file size, so processes accessing beyond the
> > end of file will get a segmentation fault (SIGSEGV).
>
> I suspect Peter was talking about using truncate() to set the initial
> file size, effectively increasing rather than reducing it.
I was thinking of truncate() on an already mmap()'ed region, either
increasing or decreasing the size so that part of the mmap becomes
(in)valid.
I'm not sure how POSIX speaks of this.
I think Linux does the expected thing.
On Tue, 04 Nov 2008 17:07:00 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, 2008-11-04 at 09:56 -0600, Chris Friesen wrote:
> > Rik van Riel wrote:
> > > Peter Zijlstra wrote:
> >
> > >> The exact interaction of mmap() and truncate() I'm not exactly clear on.
> > >
> > > Truncate will reduce the size of the mmaps on the file to
> > > match the new file size, so processes accessing beyond the
> > > end of file will get a segmentation fault (SIGSEGV).
> >
> > I suspect Peter was talking about using truncate() to set the initial
> > file size, effectively increasing rather than reducing it.
>
> I was thinking of truncate() on an already mmap()'ed region, either
> increasing or decreasing the size so that part of the mmap becomes
> (in)valid.
>
> I'm not sure how POSIX speaks of this.
>
> I think Linux does the expected thing.
I believe our behaviour is correct for mmap/mumap/truncate and it
certainly used to be and was tested.
At the point you do anything involving mremap (which is non posix) our
behaviour becomes rather bizarre.
Alan
Alan Cox wrote:
> On Tue, 04 Nov 2008 17:07:00 +0100
> Peter Zijlstra <[email protected]> wrote:
>> [snip]
>> I'm not sure how POSIX speaks of this.
>>
>> I think Linux does the expected thing.
>
> I believe our behaviour is correct for mmap/mumap/truncate and it
> certainly used to be and was tested.
>
> At the point you do anything involving mremap (which is non posix) our
> behaviour becomes rather bizarre.
Thanks to all for answers. I have made the conclusion that doing "open() new
file, truncate(<big size>), mmap(<the same big size>), write/read some memory
pages" should not populate other, untouched by write/read pages (until
MAP_POPULATE given), right?
--
Eugene V. Lyubimkin aka JackYF
On Tue, 4 Nov 2008, Eugene V. Lyubimkin wrote:
> Alan Cox wrote:
> >
> > I believe our behaviour is correct for mmap/mumap/truncate and it
> > certainly used to be and was tested.
Agreed.
> >
> > At the point you do anything involving mremap (which is non posix) our
> > behaviour becomes rather bizarre.
Certainly mremap is non-POSIX, but I can't think of any way in which
it would interfere with Eugene's assumptions about population.
(Every year or so we do wonder whether to change an extending mremap
of a MAP_SHARED|MAP_ANONYMOUS object to extend the object itself instead
of just SIGBUSing on the extension: but I've so far remained conservative
about that, and Eugene appears to be thinking of more ordinary files.)
>
> Thanks to all for answers. I have made the conclusion that doing "open() new
> file, truncate(<big size>), mmap(<the same big size>), write/read some memory
> pages" should not populate other, untouched by write/read pages (until
> MAP_POPULATE given), right?
That is a reasonable description of how the kernel tries and will always
try to handle it, approximately; but I don't think you can rely upon it
absolutely.
For a start, it depends on the filesystem: I believe that vfat, for
example, does not support the concept of sparse files (files with holes
in), so its truncate(<big size>) will allocate the whole of that big
size initially.
I'm not sure what you mean by "populate": in mm, as in MAP_POPULATE,
we're thinking of prefaulting pages into the user address space; but
you're probably thinking of whether the blocks are allocated on disk?
Prefaulting hole pages into the user address space may imply allocating
blocks on disk, or it may not: likely to depend on filesystem again.
>From time to time we toy with prefaulting adjacent pages when a fault
occurs (though IIRC tests have proved disappointing in the past): we'd
like to keep that option open, but it would go against your guidelines
above to some extent.
Hugh
> (Every year or so we do wonder whether to change an extending mremap
> of a MAP_SHARED|MAP_ANONYMOUS object to extend the object itself instead
> of just SIGBUSing on the extension: but I've so far remained conservative
> about that, and Eugene appears to be thinking of more ordinary files.)
Try an mremap of a VM_GROWS* mapping and all the other things of this
nature. I would say our current behaviour is not what might be expected
by users. The extending an object case is just one example of weird
behaviour.
Alan
Hugh Dickins wrote:
>> Thanks to all for answers. I have made the conclusion that doing "open() new
>> file, truncate(<big size>), mmap(<the same big size>), write/read some memory
>> pages" should not populate other, untouched by write/read pages (until
>> MAP_POPULATE given), right?
[snip]
> For a start, it depends on the filesystem: I believe that vfat, for
> example, does not support the concept of sparse files (files with holes
> in), so its truncate(<big size>) will allocate the whole of that big
> size initially.
For my case vfat is not an option fortunately.
> I'm not sure what you mean by "populate": in mm, as in MAP_POPULATE,
> we're thinking of prefaulting pages into the user address space; but
> you're probably thinking of whether the blocks are allocated on disk?
Yes.
>>From time to time we toy with prefaulting adjacent pages when a fault
> occurs (though IIRC tests have proved disappointing in the past): we'd
> like to keep that option open, but it would go against your guidelines
> above to some extent.
It depends how is "adjacent" would count :) If several pages, probably not. If
millions or similar, that would be a problem. It's very convenient to use such
"open+truncate+mmap+write/read" behavior to make self-growing-on-demand cache
in memory with disk as back-end without remaps.
Thanks for descriptive answer.
--
Eugene V. Lyubimkin aka JackYF
On Wed, 5 Nov 2008, Eugene V. Lyubimkin wrote:
> Hugh Dickins wrote:
>
> >>From time to time we toy with prefaulting adjacent pages when a fault
> > occurs (though IIRC tests have proved disappointing in the past): we'd
> > like to keep that option open, but it would go against your guidelines
> > above to some extent.
> It depends how is "adjacent" would count :) If several pages, probably not.
> If millions or similar, that would be a problem.
That's fine, you'll be safe: you can be sure that it would never be
in the kernel's interest to prefault more than "several" extra pages.
Well, bearing in mind that famous "640K enough for all" remark, let's
not say "never"; but it won't prefault millions until memory is so abundant
and I/O so fast that you'd be happy with it prefaulting millions yourself.
> It's very convenient to use such
> "open+truncate+mmap+write/read" behavior to make self-growing-on-demand cache
> in memory with disk as back-end without remaps.
Yes. Though one thing to beware of is running out of disk space:
whereas a write system call should be good at reporting -ENOSPC,
the filesystem may not be able to handle running out of disk space
when writing back dirty mmaped pages - it may quietly lose the data.
Hugh