2017-11-07 17:10:28

by Nicholas Piggin

[permalink] [raw]
Subject: Re: POWER: Unexpected fault when writing to brk-allocated memory

C'ing everyone who was on the x86 56-bit user virtual address patch.

I think we need more time to discuss this behaviour, in light of the
regression Florian uncovered. I would propose we turn off the 56-bit
user virtual address support for x86 for 4.14, and powerpc would
follow and turn off its 512T support until we can get a better handle
on the problems. (Actually Florian initially hit a couple of bugs in
powerpc implementation, but pulling that string uncovers a whole lot
of difficulties.)

The bi-modal behavior switched based on a combination of mmap address
hint and MAP_FIXED just sucks. It's segregating our VA space with
some non-standard heuristics, and it doesn't seem to work very well.

What are we trying to do? Allow SAP HANA etc use huge address spaces
by coding to these specific mmap heuristics we're going to add,
rather than solving it properly in a way that requires adding a new
syscall or personality or prctl or sysctl. Okay, but the cost is that
despite best efforts, it still changes ABI behaviour for existing
applications and these heuristics will become baked into the ABI that
we will have to support. Not a good tradeoff IMO.

First of all, using addr and MAP_FIXED to develop our heuristic can
never really give unchanged ABI. It's an in-band signal. brk() is a
good example that steadily keeps incrementing address, so depending
on malloc usage and address space randomization, you will get a brk()
that ends exactly at 128T, then the next one will be >
DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Second, the kernel can never completely solve the problem this way.
How do we know a malloc library will not ask for > 128TB addresses
and pass them to an unknowing application?

And lastly, there are a fair few bugs and places where description
in changelogs and mailing lists does not match code. You don't want
to know the mess in powerpc, but even x86 has two I can see:
MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
indicated it should not), arch_get_unmapped_area_topdown() with an
address hint is checking against TASK_SIZE rather than the limited
128TB address, so it looks like it won't follow the heuristics.

So unless everyone else thinks I'm crazy and disagrees, I'd ask for
a bit more time to make sure we get this interface right. I would
hope for something like prctl PR_SET_MM which can be used to set
our user virtual address bits on a fine grained basis. Maybe a
sysctl, maybe a personality. Something out-of-band. I don't wan to
get too far into that discussion yet. First we need to agree whether
or not the code in the tree today is a problem.

Thanks,
Nick

On Mon, 6 Nov 2017 09:32:25 +0100
Florian Weimer <[email protected]> wrote:

> On 11/06/2017 09:30 AM, Aneesh Kumar K.V wrote:
> > On 11/06/2017 01:55 PM, Nicholas Piggin wrote:
> >> On Mon, 6 Nov 2017 09:11:37 +0100
> >> Florian Weimer <[email protected]> wrote:
> >>
> >>> On 11/06/2017 07:47 AM, Nicholas Piggin wrote:
> >>>> "You get < 128TB unless explicitly requested."
> >>>>
> >>>> Simple, reasonable, obvious rule. Avoids breaking apps that store
> >>>> some bits in the top of pointers (provided that memory allocator
> >>>> userspace libraries also do the right thing).
> >>>
> >>> So brk would simplify fail instead of crossing the 128 TiB threshold?
> >>
> >> Yes, that was the intention and that's what x86 seems to do.
> >>
> >>>
> >>> glibc malloc should cope with that and switch to malloc, but this code
> >>> path is obviously less well-tested than the regular way.
> >>
> >> Switch to mmap() I guess you meant?
>
> Yes, sorry.
>
> >> powerpc has a couple of bugs in corner cases, so those should be fixed
> >> according to intended policy for stable kernels I think.
> >>
> >> But I question the policy. Just seems like an ugly and ineffective wart.
> >> Exactly for such cases as this -- behaviour would change from run to run
> >> depending on your address space randomization for example! In case your
> >> brk happens to land nicely on 128TB then the next one would succeed.
> >
> > Why ? It should not change between run to run. We limit the free
> > area search range based on hint address. So we should get consistent
> > results across run. even if we changed the context.addr_limit.
>
> The size of the gap to the 128 TiB limit varies between runs because of
> ASLR. So some runs would use brk alone, others would use brk + malloc.
> That's not really desirable IMHO.


From 1583417043601202476@xxx Tue Nov 07 14:16:06 +0000 2017
X-GM-THRID: 1583404961130869946
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread


2017-11-07 12:51:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: POWER: Unexpected fault when writing to brk-allocated memory

On Tue, Nov 07, 2017 at 04:07:05PM +1100, Nicholas Piggin wrote:
> C'ing everyone who was on the x86 56-bit user virtual address patch.
>
> I think we need more time to discuss this behaviour, in light of the
> regression Florian uncovered. I would propose we turn off the 56-bit
> user virtual address support for x86 for 4.14, and powerpc would
> follow and turn off its 512T support until we can get a better handle
> on the problems. (Actually Florian initially hit a couple of bugs in
> powerpc implementation, but pulling that string uncovers a whole lot
> of difficulties.)
>
> The bi-modal behavior switched based on a combination of mmap address
> hint and MAP_FIXED just sucks. It's segregating our VA space with
> some non-standard heuristics, and it doesn't seem to work very well.
>
> What are we trying to do? Allow SAP HANA etc use huge address spaces
> by coding to these specific mmap heuristics we're going to add,
> rather than solving it properly in a way that requires adding a new
> syscall or personality or prctl or sysctl. Okay, but the cost is that
> despite best efforts, it still changes ABI behaviour for existing
> applications and these heuristics will become baked into the ABI that
> we will have to support. Not a good tradeoff IMO.
>
> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

No, it won't. You will hit stack first.

> Second, the kernel can never completely solve the problem this way.
> How do we know a malloc library will not ask for > 128TB addresses
> and pass them to an unknowing application?

The idea is that an application can provide hint (mallopt() ?) to malloc
implementation that it's ready to full address space. In this case, malloc
can use mmap((void *) -1,...) for its allocations and get full address
space this way.

> And lastly, there are a fair few bugs and places where description
> in changelogs and mailing lists does not match code. You don't want
> to know the mess in powerpc, but even x86 has two I can see:
> MAP_FIXED succeeds even when crossing 128TB addresses (where changelog
> indicated it should not),

Hm. I don't see where the changelog indicated that MAP_FIXED across 128TB
shouldn't work. My intention was that it should, although I haven't stated
it in the changelog.

The idea was we shouldn't allow to slip above 47-bits by accidentally.

Correctly functioning program would never request addr+len above 47-bit
with MAP_FIXED, unless it's ready to handle such addresses. Otherwise the
request would simply fail on machine that doesn't support large VA.

In contrast, addr+len above 47-bit without MAP_FIXED will not fail on
machine that doesn't support large VA, kernel will find another place
under 47-bit. And I can imagine a reasonable application that does
something like this.

So we cannot rely that application is ready to handle large
addresses if we see addr+len without MAP_FIXED.

> arch_get_unmapped_area_topdown() with an address hint is checking
> against TASK_SIZE rather than the limited 128TB address, so it looks
> like it won't follow the heuristics.

You are right. This is broken. If user would request mapping above vdso,
but below DEFAULT_MAP_WINDOW it will succeed.

I'll send patch to fix this. But it doesn't look as a show-stopper to me.

Re-checking things for this reply I found actual bug, see:

http://lkml.kernel.org/r/[email protected]

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

Well, we've discussed before all options you are proposing.
Linus wanted a minimalistic interface, so we took this path for now.
We can always add more ways to get access to full address space later.

--
Kirill A. Shutemov

From 1583411654279734486@xxx Tue Nov 07 12:50:26 +0000 2017
X-GM-THRID: 1583404961130869946
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-07 11:04:04

by Florian Weimer

[permalink] [raw]
Subject: Re: POWER: Unexpected fault when writing to brk-allocated memory

On 11/07/2017 06:07 AM, Nicholas Piggin wrote:

> First of all, using addr and MAP_FIXED to develop our heuristic can
> never really give unchanged ABI. It's an in-band signal. brk() is a
> good example that steadily keeps incrementing address, so depending
> on malloc usage and address space randomization, you will get a brk()
> that ends exactly at 128T, then the next one will be >
> DEFAULT_MAP_WINDOW, and it will switch you to 56 bit address space.

Note that this brk phenomenon is only a concern for some currently
obscure process memory layouts where the heap ends up at the top of the
address space. Usually, there is something above it which eliminates
the possibility that it can cross into the 128 TiB wilderness. So the
brk problem only happens on some architectures (e.g., not x86-64), and
only with strange ways of running programs (explicitly ld.so invocation
and likely static PIE, too).

> So unless everyone else thinks I'm crazy and disagrees, I'd ask for
> a bit more time to make sure we get this interface right. I would
> hope for something like prctl PR_SET_MM which can be used to set
> our user virtual address bits on a fine grained basis. Maybe a
> sysctl, maybe a personality. Something out-of-band. I don't wan to
> get too far into that discussion yet. First we need to agree whether
> or not the code in the tree today is a problem.

There is certainly more demand for similar functionality, like creating
mappings below 2 GB/4 GB/32 GB, and probably other bit patterns.
Hotspot would use this to place the heap with compressed oops, instead
of manually hunting for a suitable place for the mapping. (Essentially,
32-bit pointers on 64-bit architectures for sufficiently small heap
sizes.) It would perhaps be possible to use the hints address as a
source of the bit count, for full flexibility. And the mapping should
be placed into the upper half of the selected window if possible.

MAP_FIXED is near-impossible to use correctly. I hope you don't expect
applications to do that. If you want address-based opt in, it should
work without MAP_FIXED. Sure, in obscure cases, applications might
still see out-of-range addresses, but I expected a full opt-out based on
RLIMIT_AS would be sufficient for them.

Thanks,
Florian

From 1583717217542378576@xxx Fri Nov 10 21:47:14 +0000 2017
X-GM-THRID: 1583486832070932068
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread