2004-06-16 18:13:59

by Stas Sergeev

[permalink] [raw]
Subject: [patch][rfc] expandable anonymous shared mappings


Scanned by evaluation version of Dr.Web antivirus Daemon
http://drweb.ru/unix/


Attachments:
anon_nopg.diff (1.74 kB)
shar_grow.c (998.00 B)
Download all attachments

2004-06-18 20:52:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

On Wed, 16 Jun 2004, Stas Sergeev wrote:
>
> It seems right now there is no way to
> expand the anonymous shared mappings
> (or am I missing something?)

You're right, not missing something.

> This makes this mechanism practically
> useless for many tasks, it otherwise could
> suit very well, I beleive.

You may well be right, though I've no idea of those tasks myself.

Your test case doesn't fork at all: didn't need to to demonstrate
the behaviour you object to, but it doesn't give us any hints of
the use for the change you suggest - shared anonymous only becomes
interesting when you fork.

But I do rather like your idea. At first I was sceptical, thinking
the same could be achieved in other ways. But once forked, it's too
late to extend with further shared anonymous areas, they won't be
shared with your children by then. You can use shared maps of actual
or tmpfs files, or SysV shm, but both are more cumbersome.

Your point, which I now accept, is that we've got this odd thing of
shared anonymous mappings, and this odd thing of mremap extending
maps, can we please put them together in a way that's natural and
useful, rather than in a way that's useless. (And Linux is free
to specify mremap as it wishes, so long as reasonable compatibility
is retained - and I agree we don't need to reproduce every SIGBUS
of yesteryear. I would feel uncomfortable about letting mremap
extend SysV shm objects, but that doesn't happen with your patch.)

> The attached patch implements a simple
> nopage vm op for anonymous shared mappings,
> which allows to expand them with mremap().

I've appended a slightly different patch below: much like yours,
but I've a few issues with your shmem_zero_nopage implementation
and so reworked it e.g. it's scary to see an i_sem taken within
mmap_sem (though I don't think there's really any possibility of
deadlock in this case, given the special nature of this object);
we actually want to avoid vmtruncate's RLIMIT_FSIZE check (it's
an implementation detail that we're using something like a file
to back the shared anonymous mapping, the relevant rlimit is
RLIMIT_AS already checked by mremap); must be careful about
two nopages extending at the same time; might as well extend
object to end of vma in one go rather than fault by fault.

Is this patch good for you?

But... I've a couple of reservations, which make me reluctant
to push this patch further without encouragement from others.

One reservation: is there any security aspect to such a change?
Previously the parent (or ancestor) determined the maximum size
of the anonymous shared object, now children can extend it (and
it will all remain allocated until the very last mapping of any
part of it is unmapped). I don't see an actual problem with that,
just see that it leads me into an area where I'm a little uneasy
(might we need a further MAP_ flag to allow such extension?),
and had better rely on the judgements of others.

Other reservation: it's nicely only a few lines of code, but
it really is a bit of a hack to be extending the object in the
nopage, with SIGBUS on failure. Isn't the right approach (more
trouble than it's worth?) to extend vm_operations_struct with
an mremap method (perhaps NULL .mremap would be equivalent to
VM_DONTEXPAND, and replace that flag in due course)? Which in
the shmem_zero case would extend the underlying object (or fail
with a proper mremap failure), and in other cases do nothing but
allow the extension to succeed. My own inclination is towards
the quick hack in this instance (add the method only when more
uses appear), but again I'd defer to the judgements of others.

Comments?

Hugh

--- 2.6.7/mm/shmem.c 2004-06-16 06:20:39.000000000 +0100
+++ linux/mm/shmem.c 2004-06-17 20:26:26.567507056 +0100
@@ -169,6 +169,7 @@ static struct file_operations shmem_file
static struct inode_operations shmem_inode_operations;
static struct inode_operations shmem_dir_inode_operations;
static struct vm_operations_struct shmem_vm_ops;
+static struct vm_operations_struct shmem_zero_vm_ops;

static struct backing_dev_info shmem_backing_dev_info = {
.ra_pages = 0, /* No readahead */
@@ -1076,7 +1077,8 @@ failed:
return error;
}

-struct page *shmem_nopage(struct vm_area_struct *vma, unsigned long address, int *type)
+struct page *shmem_nopage(struct vm_area_struct *vma,
+ unsigned long address, int *type)
{
struct inode *inode = vma->vm_file->f_dentry->d_inode;
struct page *page = NULL;
@@ -1095,6 +1097,36 @@ struct page *shmem_nopage(struct vm_area
return page;
}

+struct page *shmem_zero_nopage(struct vm_area_struct *vma,
+ unsigned long address, int *type)
+{
+ struct page *page;
+
+ page = shmem_nopage(vma, address, type);
+ if (unlikely(page == NOPAGE_SIGBUS)) {
+ struct inode *inode = vma->vm_file->f_dentry->d_inode;
+ struct shmem_inode_info *info = SHMEM_I(inode);
+ loff_t vm_size, i_size;
+
+ /*
+ * If mremap has been used to extend the vma,
+ * now extend the underlying object to match it.
+ */
+ vm_size = (vma->vm_end - vma->vm_start) +
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+ spin_lock(&info->lock);
+ i_size = i_size_read(inode);
+ if (i_size < vm_size && vm_size <= SHMEM_MAX_BYTES &&
+ !shmem_acct_size(info->flags, vm_size - i_size)) {
+ i_size_write(inode, vm_size);
+ inode->i_ctime = inode->i_mtime = CURRENT_TIME;
+ }
+ spin_unlock(&info->lock);
+ page = shmem_nopage(vma, address, type);
+ }
+ return page;
+}
+
static int shmem_populate(struct vm_area_struct *vma,
unsigned long addr, unsigned long len,
pgprot_t prot, unsigned long pgoff, int nonblock)
@@ -1976,6 +2008,15 @@ static struct vm_operations_struct shmem
#endif
};

+static struct vm_operations_struct shmem_zero_vm_ops = {
+ .nopage = shmem_zero_nopage,
+ .populate = shmem_populate,
+#ifdef CONFIG_NUMA
+ .set_policy = shmem_set_policy,
+ .get_policy = shmem_get_policy,
+#endif
+};
+
static struct super_block *shmem_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data)
{
@@ -2098,7 +2139,8 @@ put_memory:
int shmem_zero_setup(struct vm_area_struct *vma)
{
struct file *file;
- loff_t size = vma->vm_end - vma->vm_start;
+ loff_t size = (vma->vm_end - vma->vm_start) +
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

file = shmem_file_setup("dev/zero", size, vma->vm_flags);
if (IS_ERR(file))
@@ -2107,7 +2149,7 @@ int shmem_zero_setup(struct vm_area_stru
if (vma->vm_file)
fput(vma->vm_file);
vma->vm_file = file;
- vma->vm_ops = &shmem_vm_ops;
+ vma->vm_ops = &shmem_zero_vm_ops;
return 0;
}


2004-06-20 00:20:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

On Sat, 19 Jun 2004, Stas Sergeev wrote:
> Hugh Dickins wrote:
> > shared anonymous only becomes interesting when you fork.

> I disagree with this. The way I am using it may look horrible,
> but yes, I do use it without the fork().

Then I think you have no reason to use MAP_SHARED: use MAP_PRIVATE
and you should get the behaviour you require, without kernel change.

Shared anonymous is peculiar: although mapping is anonymous (nothing
shared with unrelated mms), modifications are shared between parent
and children. It's half-way between anonymous and file-backed.

We agree that it might be nice to let the object used to support that
be extended if mremap extends the mapping. But it might instead just
be needless feature creep. Sorry, your case does not persuade me yet.

Hugh

2004-06-20 09:46:56

by Stas Sergeev

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

Hi.

Hugh Dickins wrote:
>> I disagree with this. The way I am using it may look horrible,
>> but yes, I do use it without the fork().
> Then I think you have no reason to use MAP_SHARED: use MAP_PRIVATE
> and you should get the behaviour you require, without kernel change.
Hugh, I think you misunderstood me because
once again I wrote something, said nothing -
happens to me sometimes.
The trick is that I am setting the old_len arg
of mremap() to 0. This means that the new mapping
is created while the old one is *not* being
destroyed. So I get multiple virtual memory
areas referencing the same shared memory region,
lets call them "aliases".
You propose to share the same backing-store
across the multiple processes. Instead, I am
sharing it across the multiple memory areas
of a single process. I can't use MAP_PRIVATE
for that, really.
Then I want to expand my initial pool of shared
mem while preserving the already created "aliases",
and there seems to be no way of doing that:
creating the larger pool and mremap'ing old
one to the beginning of the new one, leaves the
VMAs unmerged; creating the larger pool and
mamcpy() the content of the old one to it,
doesn't preserve the already created "aliases".
The only thing I can do, is to expand the
initial pool with mremap(), which doesn't work.
So I have to resort to the more heavyweight
things like shm_open(), while otherwise the
expandable anonymous shared mapping can suit
very well for my needs.

> Shared anonymous is peculiar: although mapping is anonymous (nothing
> shared with unrelated mms), modifications are shared between parent
> and children. It's half-way between anonymous and file-backed.
> We agree that it might be nice to let the object used to support that
> be extended if mremap extends the mapping. But it might instead just
> be needless feature creep.
But then I beleive the entire idea of anonymous
shared mapping is also a crap. But since it is
already there, I would like to have it fully
functional, so that I can avoid the things like
shm_open() when possible. I just don't see the
reason of keeping something partially implemented.

> Sorry, your case does not persuade me yet.
Well, I beleive perhaps you missed the fact
that I was setting the old_len to 0 in mremap(),
which doesn't work as I want to, when you use
MAP_PRIVATE.
You'll probably call it a horrible hack, but
here's where that technique comes from:
http://www.ussg.iu.edu/hypermail/linux/kernel/0401.1/1351.html
And since it comes from here, I beleive it
must be fully supported.

2004-06-23 11:24:07

by Christoph Rohland

[permalink] [raw]
Subject: RE: [patch][rfc] expandable anonymous shared mappings

Hi Stas, Hugh,

> The trick is that I am setting the old_len arg
> of mremap() to 0. This means that the new mapping
> is created while the old one is *not* being
> destroyed. So I get multiple virtual memory
> areas referencing the same shared memory region,
> lets call them "aliases".

I would propose you use the posix shm API for what you want to do and leave
anonymous shared memory as a special case of this like it is...

Keep it simple and stupid!

Just my 2cts
Christoph

2004-06-23 18:45:05

by Stas Sergeev

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

Hi Christoph.

Christoph Rohland wrote:
>> The trick is that I am setting the old_len arg
>> of mremap() to 0. This means that the new mapping
>> is created while the old one is *not* being
>> destroyed. So I get multiple virtual memory
>> areas referencing the same shared memory region,
>> lets call them "aliases".
> I would propose you use the posix shm API for what you want to do and leave
> anonymous shared memory as a special case of this like it is...
> Keep it simple and stupid!
Yes, I can use the posix shm API for that. But no,
this will not be simple and stupid. Anon shared
mapping - that's the thing that could really be
simple and stupid, if you ask me.
Posix shm API have some disadvantages in my eyes:
1. If you need many small SHM objects, in order
to manage them vrt resizing, you'll have to track
the addresses to the corresponding FDs, so that you
can ftruncate(). Managing multiple FDs can be avoided
by using the custom allocator on a single large pool.
But in both ways I have to duplicate some functionality
which is already in the kernel, but that I just can't
use effectively.
With the anon mapping I was hoping to avoid this
duplication. The fact that some functionality cannot
be used in an effective way, also doesn't appeal to
me.
2. posix shm API (if we are talking about shm_open()
and friends) is, AFAIK, implemented in librt. This,
in turn, will get the libpthread.so to be linked to
your program. I have problems with libpthreads. For
ages it has the severe bugs, for example it was not
possible to use sigaltstack() in the programs which
are linked with libpthreads. And I am using the
sigaltstack()... It is fixed in the very recent
glibc, but upgrading glibc can be sometimes a bit
problematic.

But no, my prog won't die without the expandable
anon shared mapping thing. I can resort to the other
matters. I can (as I do right now) to just open a
tmp file in /dev/shm and use the custom allocator.
Yes, that works, but I wanted to keep it "simple
and stupid". The anon shared mapping looked like
the good candidate to try out.
And then I thought it may be nice if the kernel
to provide this functionality. If you suggest me
to use the other mechanisms, you probably imply
that doing such a things in the kernel is not a
good idea. If this functionality is considered
bad or useless by the kernel developers, then of
course it shouldn't be integrated and I'll use other
things. But really it looks like a bug to me, as
both the file-backed shared, and the anon-private
mappings are expandable.
I also accept the Hugh's reservations and I see
integrating this may be problematic. But since he
also said he liked the idea itself, I'll wait for
a while before dropping the use of the anon shared
mapping in my prog.

2004-06-28 14:22:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

I hope my lightning responses don't catch you unawares ;)

On Sun, 20 Jun 2004, Stas Sergeev wrote:
> Hugh Dickins wrote:
> >> I disagree with this. The way I am using it may look horrible,
> >> but yes, I do use it without the fork().
> > Then I think you have no reason to use MAP_SHARED: use MAP_PRIVATE
> > and you should get the behaviour you require, without kernel change.
> Hugh, I think you misunderstood me because
> once again I wrote something, said nothing -
> happens to me sometimes.
> The trick is that I am setting the old_len arg
> of mremap() to 0.

Yes, I did indeed overlook that, sorry. You're quite right that you
need MAP_SHARED for this use, and my dismissal was wrong. Thanks
for forcing me to understand at last how mremap old_len 0 behaves!

>From time to time in the last week I've reconsidered your feature
(faulting on mremap-expanded area of shared anonymous object expands
the object itself instead of SIGBUS): I still _like_ it, but cannot
shake off my unease with it.

I like the few lines of code in shmem.c, and I don't mind that it's
a bit of a hack that ought really to be done at mremap time -
I'm happier with the fewness of lines as you have it.

But I'm still uneasy how children (not in your usage) could expand
the object without the parent knowing, all the object remaining in
use until all mms have unmapped all portions. Plus I've realized
we've no complementary interface to shrink the object (we cannot
redefine the behaviour of mremap with new_len less than old_len).

Both those could be dealt with, by adding a new MAP_flag
and a new MREMAP_flag - but just for you?

I'm wary of a feature without its complement; and I'm wary of
adding rather odd neat new features which just _might_ cause
difficulty down the line. mremap is itself a rather odd neat
feature, and shared anonymous is another rather odd neat feature.

You're only trying to get them to play better together, but
extending a file (at fault time or within mremap) is something
new, which could be troublesome to go on supporting if we change
other things around it (odd cases often end up being a nuisance
when it comes to lock ordering, for example).

> and there seems to be no way of doing that:
> creating the larger pool and mremap'ing old
> one to the beginning of the new one, leaves the
> VMAs unmerged; creating the larger pool and
> mamcpy() the content of the old one to it,
> doesn't preserve the already created "aliases".
> The only thing I can do, is to expand the
> initial pool with mremap(), which doesn't work.
> So I have to resort to the more heavyweight
> things like shm_open(), while otherwise the
> expandable anonymous shared mapping can suit
> very well for my needs.

I don't see how shm helps you, that's fixed-size objects too, isn't it?

Can't you use a tmpfs file? Create it, unlink it, mmap a page of it,
ftruncate and mremap together whenever you need? That's then using
standard interfaces without any kernel change - provided CONFIG_TMPFS=y.

> > Shared anonymous is peculiar: although mapping is anonymous (nothing
> > shared with unrelated mms), modifications are shared between parent
> > and children. It's half-way between anonymous and file-backed.
> > We agree that it might be nice to let the object used to support that
> > be extended if mremap extends the mapping. But it might instead just
> > be needless feature creep.
> But then I beleive the entire idea of anonymous
> shared mapping is also a crap. But since it is
> already there, I would like to have it fully
> functional, so that I can avoid the things like
> shm_open() when possible. I just don't see the
> reason of keeping something partially implemented.

Shared anonymous is fully implemented; you have a natural
and useful extension to its behaviour with mremap, I agree it
seems reasonable, but I think we're safer not to make the change.

> > Sorry, your case does not persuade me yet.
> Well, I beleive perhaps you missed the fact
> that I was setting the old_len to 0 in mremap(),
> which doesn't work as I want to, when you use
> MAP_PRIVATE.
> You'll probably call it a horrible hack, but
> here's where that technique comes from:
> http://www.ussg.iu.edu/hypermail/linux/kernel/0401.1/1351.html
> And since it comes from here, I beleive it
> must be fully supported.

Linus was arguing for back-compatibility, that we must continue
to support old_len 0: he wasn't asking for new features here.

I don't oppose your patch - except in the details, please stick
with my version if you continue with it. Your second version
of shmem_zero_nopage remained incorrect (accounting for a size
change without holding the lock against another making a change
at the same time); and in removing the double shmem_nopage you
were optimizing for the very rare (newly allowed) case, at the
expense of every normal use (which go through the size tests
twice with yours).

If you want to persuade Andrew or Linus directly with the patch,
that's fine by me, but my own opinion is to let caution rule.

Hugh

2004-06-28 14:25:30

by Christoph Rohland

[permalink] [raw]
Subject: RE: [patch][rfc] expandable anonymous shared mappings

Hi Stas,

> But no, my prog won't die without the expandable
> anon shared mapping thing. I can resort to the other
> matters. I can (as I do right now) to just open a
> tmp file in /dev/shm and use the custom allocator.

BTW that's what librt does: look into /dev/shm scheck if it is tmpfs and do
an open...

> Yes, that works, but I wanted to keep it "simple
> and stupid". The anon shared mapping looked like
> the good candidate to try out.

Yes, I understand. For your use case anon shared mem with the small addition
is perfect.

> And then I thought it may be nice if the kernel
> to provide this functionality.

For me the question is: The functionality you propose is very specific for
your problem. The kernel should provide general solutions. So is there a
bigger demand for this?

Greetings
Christoph

2004-06-28 18:49:00

by Stas Sergeev

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

Hi Hugh.

Hugh Dickins wrote:
> I like the few lines of code in shmem.c, and I don't mind that it's
> a bit of a hack that ought really to be done at mremap time -
> I'm happier with the fewness of lines as you have it.
But you also mentioned the per-vma mremap handler.
Have you considered it an overkill after all?

> Plus I've realized
> we've no complementary interface to shrink the object (we cannot
> redefine the behaviour of mremap with new_len less than old_len).
What I was thinking about, is that this inode truncation
mess should be resolved by the kernel completely internally,
and the app should see only the anonymous mapping. From
that point of view I think there is no such a problem.
Program will shrink the mapping with the usual mremap(),
inode will not be trunceted, but so what? The mapping
will be shrunk, the virtual address space will be freed,
so what is to worry about? I might be missing something.

> Both those could be dealt with, by adding a new MAP_flag
> and a new MREMAP_flag - but just for you?
But I think the per-vma mremap() handler, as you
suggested before, can solve everything, isn't it?
And I beleive this will not be usefull only for me
then, while the new flags may indeed be usefull for
me only (which is not good).

> You're only trying to get them to play better together, but
> extending a file (at fault time or within mremap) is something
> new, which could be troublesome to go on supporting if we change
> other things around it
I agree with this completely, but my point is that
we are already deeply in that mess anyway. In particular,
I see basically the same things in shmem_zero_setup()/
shmem_file_setup(). It is absolutely similar as far
as I can tell, so I don't agree that I proposed really
something new to it. The only difference I can see is that
it is being invoked not by the fault handler...

>> So I have to resort to the more heavyweight
>> things like shm_open(), while otherwise the
>> expandable anonymous shared mapping can suit
>> very well for my needs.
> I don't see how shm helps you, that's fixed-size objects too, isn't it?
AFAIK - no, but actually I haven't tried, used a
custom allocator on a single large object instead.
AFAIK you can expand the posix shm objects with
mremap() perfectly well, you just need to
ftruncate() first. No? If no, I may reconsider
the sanity of this my proposal entirely. I thought
mremap() works to expand any kind of shared mapping,
that's why I thought the inability to expand one
particular type of shared mapping is a misbehavoir.
How about this:
http://www.ussg.iu.edu/hypermail/linux/kernel/9603.2/0692.html

> Can't you use a tmpfs file? Create it, unlink it, mmap a page of it,
> ftruncate and mremap together whenever you need? That's then using
> standard interfaces without any kernel change - provided CONFIG_TMPFS=y.
The problem is that I need multiple independantly
shared allocations. I can either use multiple tmp
files and track their mapping addresses to their
descriptors by hands if I want to ftruncate(), or
I have to use the custom memory allocator on a single
pool - thats what I do now, but I wanted to make it
simpler. And the shared anon mapping could make it
simpler for me I beleive.
But I am not insisting in case you think it is really
not worth an efforts to do in the kernel. I just
want to point out the potential advantages because
I beleive if it is implemented, it can be usefull
not only for me.

>> shm_open() when possible. I just don't see the
>> reason of keeping something partially implemented.
> Shared anonymous is fully implemented; you have a natural
> and useful extension to its behaviour with mremap
I thought any kinds of mapping should be expandable
with mremap(), esp. the shared ones. This impression
comes to me from the above URL. From that point of
view this is not an extension, but rather the
unimplemented feature. If it is only an extension,
then perhaps it is really not worth the troubles.

> Linus was arguing for back-compatibility, that we must continue
> to support old_len 0: he wasn't asking for new features here.
I feel an inconsistency here. Consider you mmap'ed
10 pages and duplicated 1 page with mremap(,0,).
Then you get this area of one page, which you
actually *can* expand with mremap(). But as soon
as you expanded it more than to original 10 pages -
SIGBUS. And not immediately, but rather when you
get around to access it, i.e. in a completely
unpredictable place:(
And what's worse, is that mremap() perfectly
succeeds, so you don't even expect the failure.
What's even worse, is that your /proc/self/maps
will show that expanded region as being present
and one will never expect the SIGBUS after all this.
If at least /proc/self/maps to reveal the truth,
or mremap() to fail - then yes, the program might
be able to handle the failures. But everything
suggests that the mapping was expanded, but you
just give the SIGBUS in your face at one point.
For some reasons this looks extremely insane and
unreliable to me. My program is rather complex
and it runs the "foreign" code sometimes (say,
third-party plugins), so I have some difficulties
to control everything by hands, when I can't get
the reliable result from the kernel.

> I don't oppose your patch - except in the details, please stick
> with my version if you continue with it.
Sure, no problems.

> If you want to persuade Andrew or Linus directly with the patch,
No, I won't of course, and besides, this may not
be even remotely possible:)

> that's fine by me, but my own opinion is to let caution rule.
OK, I tried my best to make up my point.
I was not trying to convince you that I need
it, rather I was trying point out that without
this patch, mremap() behaves not the way it is
expected to behave, furthermore, it behaves
unreliably and there is no way to find out
whether it is really succeeded or not.
If you still feel it is only a nice extension
to the otherwise perfectly functional interface,
then there might be no point to include it only
for me. My main point was to avoid the mremap's
unreliability. There are cases where nopage
handler is missing. You get the zero-page mapped
in, in these cases. This is bad, but at least you
don't get a SIGBUS. Here we have a completely
different thing - nopage handler is present, but
doesn't really work, and you get the really
unexpected and unpredictable results. I wanted
to make them both expected and predictable.
With this in mind, I would even feel safer if
this functionality is not available at all, rather
then letting it go the way it is now.

Btw, this small patch about VM_DONTEXPAND in my
previous mail, what do you think about that one?

2004-06-28 21:45:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

On Mon, 28 Jun 2004, Stas Sergeev wrote:
> Hugh Dickins wrote:
> > I like the few lines of code in shmem.c, and I don't mind that it's
> > a bit of a hack that ought really to be done at mremap time -
> > I'm happier with the fewness of lines as you have it.

> But you also mentioned the per-vma mremap handler.
> Have you considered it an overkill after all?

Yes, that's what I meant: on balance, in the absence of further uses
for a new .mremap vm_operations method, I do consider it overkill,
and prefer your little addition of shmem_zero_nopage.

> > Plus I've realized
> > we've no complementary interface to shrink the object (we cannot
> > redefine the behaviour of mremap with new_len less than old_len).

> What I was thinking about, is that this inode truncation
> mess should be resolved by the kernel completely internally,
> and the app should see only the anonymous mapping. From
> that point of view I think there is no such a problem.
> Program will shrink the mapping with the usual mremap(),
> inode will not be trunceted, but so what? The mapping
> will be shrunk, the virtual address space will be freed,
> so what is to worry about? I might be missing something.

The underlying "object" is not shrunk, /proc/meminfo Committed_AS
is not shrunk, /proc/sys/vm/overcommit_memory 2 will still consider
the memory in reserved, it'll still be taking up memory+swap - and
must be, in case an mremap expand follows to make it accessible again.

None of which is an issue within a single mm, it's all normal.
And someone with a more reliable security instinct than I may be
able to assure us that it's not an issue within a group of mms,
forked from each other.

But to me it does look like a new situation, that before the change
the parent of all the forked anonymous shared mappings imposed the
upper limit on their memory usage, but with the change any child
could change that irreversibly (irreversibly until all have unmapped
all or exited). Danger of memory going missing mysteriously (perhaps
equivalent to existing danger, but I've not been convinced of that).

> > Both those could be dealt with, by adding a new MAP_flag
> > and a new MREMAP_flag - but just for you?

> But I think the per-vma mremap() handler, as you
> suggested before, can solve everything, isn't it?
> And I beleive this will not be usefull only for me
> then, while the new flags may indeed be usefull for
> me only (which is not good).

Either way, without someone else speaking up, it does seem just for you.

> > You're only trying to get them to play better together, but
> > extending a file (at fault time or within mremap) is something
> > new, which could be troublesome to go on supporting if we change
> > other things around it

> I agree with this completely, but my point is that
> we are already deeply in that mess anyway. In particular,
> I see basically the same things in shmem_zero_setup()/
> shmem_file_setup(). It is absolutely similar as far
> as I can tell, so I don't agree that I proposed really
> something new to it. The only difference I can see is that
> it is being invoked not by the fault handler...

Shared anonymous is (and has to be) implemented quite differently
from private anonymous. You rightly resent that the implementation
is imposing a surprising limitation on the user-visible semantics
(surprising compared with private; expected compared with shared
file, but then we hit that there's no ftruncate to extend anon).

I sympathise. But I've never heard that complained of before,
you're exaggerating to say we're deeply in a mess here. I'd
like to change it, but pathetic caution tells me to beware,
the change might cause trouble down the line.

> >> So I have to resort to the more heavyweight
> >> things like shm_open(), while otherwise the
> >> expandable anonymous shared mapping can suit
> >> very well for my needs.
> > I don't see how shm helps you, that's fixed-size objects too, isn't it?

> AFAIK - no, but actually I haven't tried, used a
> custom allocator on a single large object instead.
> AFAIK you can expand the posix shm objects with
> mremap() perfectly well, you just need to
> ftruncate() first. No? If no, I may reconsider

Sorry, I was talking SysV shm there. POSIX shm, as I understand it,
simply works with the tmpfs mounted at /dev/shm: I never think of
POSIX shm, just of tmpfs. Yes, you can use ftruncate on an fd
created with shm_open.

> the sanity of this my proposal entirely. I thought
> mremap() works to expand any kind of shared mapping,
> that's why I thought the inability to expand one
> particular type of shared mapping is a misbehavoir.
> How about this:
> http://www.ussg.iu.edu/hypermail/linux/kernel/9603.2/0692.html

Hey, (no offence!) are you a lawyer? You're very good at looking
up past history to support your case. But I'm afraid this does
not actually add support to your case (certainly doesn't subtract
from it either) - it's about using mremap to track extensions
to a file extended by other means.

> > Can't you use a tmpfs file? Create it, unlink it, mmap a page of it,
> > ftruncate and mremap together whenever you need? That's then using
> > standard interfaces without any kernel change - provided CONFIG_TMPFS=y.

> The problem is that I need multiple independantly
> shared allocations. I can either use multiple tmp
> files and track their mapping addresses to their
> descriptors by hands if I want to ftruncate(), or
> I have to use the custom memory allocator on a single
> pool - thats what I do now, but I wanted to make it
> simpler. And the shared anon mapping could make it
> simpler for me I beleive.

Yes, it would make it simpler, but not significantly simpler.
You could ask instead for an mtruncate system call (similarly
to save having to keep track of fds with addresses), but the
need does not seem compelling.

> But I am not insisting in case you think it is really
> not worth an efforts to do in the kernel. I just
> want to point out the potential advantages because
> I beleive if it is implemented, it can be usefull
> not only for me.

Nobody else has spoken up yet.

> >> shm_open() when possible. I just don't see the
> >> reason of keeping something partially implemented.
> > Shared anonymous is fully implemented; you have a natural
> > and useful extension to its behaviour with mremap

> I thought any kinds of mapping should be expandable
> with mremap(), esp. the shared ones. This impression
> comes to me from the above URL. From that point of
> view this is not an extension, but rather the
> unimplemented feature. If it is only an extension,
> then perhaps it is really not worth the troubles.

It's an extension, a natural extension in principle (seen
from the private anon side), but in implementation less natural.

> > Linus was arguing for back-compatibility, that we must continue
> > to support old_len 0: he wasn't asking for new features here.
> I feel an inconsistency here. Consider you mmap'ed
> 10 pages and duplicated 1 page with mremap(,0,).

[snip surprise at SIGBUS beyond end of shared object]

We're just repeating our acknowledged positions.

> Btw, this small patch about VM_DONTEXPAND in my
> previous mail, what do you think about that one?

This one? I overlooked it before: it seems to be a slight
redefinition of VM_DONTEXPAND, and I don't see what bug it fixes.

--- linux-2.6.6/mm/mremap.c.old 2004-06-14 19:48:36.000000000 +0400
+++ linux-2.6.6/mm/mremap.c 2004-06-19 11:54:28.508681472 +0400
@@ -320,7 +320,7 @@
if (old_len > vma->vm_end - addr)
goto out;
if (vma->vm_flags & VM_DONTEXPAND) {
- if (new_len > old_len)
+ if (new_len > vma->vm_end - addr)
goto out;
}
if (vma->vm_flags & VM_LOCKED) {

Hugh

2004-06-29 17:06:31

by Stas Sergeev

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

Hi Hugh.

Hugh Dickins wrote:
> But I've never heard that complained of before,
> you're exaggerating to say we're deeply in a mess here. I'd
What I wanted to say is only that shmem_file_setup()
does exactly the same thing - creates and truncates
the file itself. But this is no longer important
as I finally realized your points.

> Sorry, I was talking SysV shm there.
Excellent point btw! I verified that, mremap() on
the SysV shm gives me the same nice SIGBUS. I feel
much better right now:) I was wrong assuming that
expanding the shared mapping with mremap() should
be valid in any case, SysV shm is an obvious example,
and it resembles the anon-shm the way they both do
not have an FD accessible.
So yes, I finally realized that my proposal is
nothing more than an extension to the otherwise
functional interface, and should not be traded as
a bugfix. And yes, the inability to shrink it back
is also not friendly. As for not allowing children
to expand, I'd vote for MAP_DONTEXPAND flag, but
this already have nothing to do with my original
proposal, so no need to worry about it.
Thanks for you time and explanations, after all they
worked out right:)

>> http://www.ussg.iu.edu/hypermail/linux/kernel/9603.2/0692.html
> up past history to support your case. But I'm afraid this does
> not actually add support to your case (certainly doesn't subtract
> from it either) - it's about using mremap to track extensions
> to a file extended by other means.
Well, what I meant pointing to this URL, was this:
---
In most
cases (at least for the malloc case) this will be a anonymous mapping,
but it's by no means an error to extend any mapping you have created.
---
Either I read it the completely wrong way, or it
is no longer valid, but in the light of the above,
it seems like indeed this doesn't add the support
to my case.

> Yes, it would make it simpler, but not significantly simpler.
You are right.

> You could ask instead for an mtruncate system call (similarly
Good idea, why not? :)

> [snip surprise at SIGBUS beyond end of shared object]
But you snipped the most important part:)
Now I understand, however, the problems I had with
mremap(), have nothing special to do with the anon-shm,
it just seems to be the usual thing with that syscall.
It is full of surprises, it will map a zero-page to you,
it will give you a SIGBUS, it will do everything
but not what you really want from it:) I always
wanted to have the more reliable mremap(). Not the
one that can expand everything in the world, but
the one that returns the value you can rely upon,
as all the other syscalls do. But the way I wanted
to achieve it (by implementing the "proper" nopage handler
for anon-shm) is definitely not the right one. That's
why I was very happy when you proposed the .mremap
handlers per vma, but since this have nothing to do
with the current subject, I should probably go
complaining about mremap() elsewhere.

Thank you and Christoph for the very usefull
discussion. It was usefull maybe again only
for me, but I hope you weren't bored with it too
much either.

Thanks!

2004-06-29 17:41:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

On Tue, 29 Jun 2004, Stas Sergeev wrote:
> Now I understand, however, the problems I had with
> mremap(), have nothing special to do with the anon-shm,
> it just seems to be the usual thing with that syscall.
> It is full of surprises, it will map a zero-page to you,
> it will give you a SIGBUS, it will do everything
> but not what you really want from it:) I always
> wanted to have the more reliable mremap(). Not the
> one that can expand everything in the world, but
> the one that returns the value you can rely upon,
> as all the other syscalls do.

>From this attack on poor little mremap(), I think perhaps there's
something else you didn't realize, that I'd assumed you did realize.

If you have a file of size, say, 2 pages; and you mmap 3 pages of
it (from offset 0 for simplicity); then you try to access the third
page of your mapping... you get SIGBUS. That's so whether it's a
shared or a private mapping. It's even so if it's a private mapping,
you write your own data into the second page, you truncate the file
to 1 page, you try to access your own data in the second page again.
That's how mmap is specified to behave, not just on Linux.

So mremap() is entirely consistent to allow you to extend a mapping
beyond the end of the object, such that you'll get SIGBUS if you
try to access the end of your mapping.

The deficiency with shared anonymous is that there's no way to expand
or shrink the underlying object to match the mapping, whereas you can
ftruncate a real file.

Hugh

2004-06-29 19:03:53

by Stas Sergeev

[permalink] [raw]
Subject: Re: [patch][rfc] expandable anonymous shared mappings

Hi.

Hugh Dickins wrote:
>> It is full of surprises, it will map a zero-page to you,
>> it will give you a SIGBUS, it will do everything
> From this attack on poor little mremap(), I think perhaps there's
> something else you didn't realize, that I'd assumed you did realize.
OK, the last time I was messing around the
mremap, it was mapping the anonymous pages
to me when I was trying to expand the /dev/mem
mapping (missing nopage handler apparently,
having the failure could be much better in that
case I think).

> So mremap() is entirely consistent to allow you to extend a mapping
> beyond the end of the object, such that you'll get SIGBUS if you
> try to access the end of your mapping.
Yes, as for SIGBUS - now I see why it works that
way (not necessary accept its correctness, but
what you said, sounds perfectly valid to me, so
the complete retirement on my side:)

> The deficiency with shared anonymous is that there's no way to expand
> or shrink the underlying object to match the mapping, whereas you can
> ftruncate a real file.
... and/or expand the anonymous private mapping without
the truncation (nothing to truncate). So yes, after all,
it was specific to the anon-shm, which is a special case.
And introducing the per-vma mremap handlers to fix only
this special case (by doing the truncation immediately)
is an overkill and is not required by anyone except myself.
And this:
---
In most
cases (at least for the malloc case) this will be a anonymous mapping,
but it's by no means an error to extend any mapping you have created.
---
may just mean that it is never an error, but you have
to deal with the consequences yourself if you expanded
only the mapping and not the object (except for the
case of anon-shm, where you can't deal with the
consequences anyhow since you can't expand the object).
Have I got everything properly this time, or failed the
homework agan? :)