Date: Wed, 5 Jul 2017 17:59:45 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Ben Hutchings <ben@decadent.org.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Willy Tarreau <w@1wt.eu>, Hugh Dickins <hughd@google.com>,
        Oleg Nesterov <oleg@redhat.com>,
        "Jason A. Donenfeld" <Jason@zx2c4.com>, Rik van Riel <riel@redhat.com>,
        Larry Woodman <lwoodman@redhat.com>,
        "Kirill A. Shutemov" <kirill@shutemov.name>,
        Tony Luck <tony.luck@intel.com>,
        "James E.J. Bottomley" <jejb@parisc-linux.org>,
        Helge Diller <deller@gmx.de>, James Hogan <james.hogan@imgtec.com>,
        Laura Abbott <labbott@redhat.com>, Greg KH <greg@kroah.com>,
        "security@kernel.org" <security@kernel.org>,
        Qualys Security Advisory <qsa@qualys.com>,
        LKML <linux-kernel@vger.kernel.org>, Ximin Luo <infinity0@debian.org>
Subject: Re: [PATCH] mm: larger stack guard gap, between vmas
Message-ID: <20170705155944.GC21220@dhcp22.suse.cz>
References: <20170704084122.GC14722@dhcp22.suse.cz>
 <20170704093538.GF14722@dhcp22.suse.cz>
 <20170704094728.GB22013@1wt.eu>
 <20170704104211.GG14722@dhcp22.suse.cz>
 <20170704113611.GA4732@decadent.org.uk>
 <1499209315.2707.29.camel@decadent.org.uk>
 <CA+55aFzjibuaV=GBZQN8KOaCpf1P3B1abJCYGaxEM9RVpf9fXg@mail.gmail.com>
 <1499257180.2707.34.camel@decadent.org.uk>
 <20170705142354.GB21220@dhcp22.suse.cz>
 <1499268300.2707.41.camel@decadent.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <1499268300.2707.41.camel@decadent.org.uk>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3539
Lines: 79

On Wed 05-07-17 16:25:00, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote:
> > On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <ben@decadent.org.uk>
> > > > wrote:
> > > > > 
> > > > > We have:
> > > > > 
> > > > > bottom = 0xff803fff
> > > > > sp =?????0xffffb178
> > > > > 
> > > > > The relevant mappings are:
> > > > > 
> > > > > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > > > > fffdd000-ffffe000 rw-p 00000000 00:00
> > > > > 0??????????????????????????????????[stack]
> > > > 
> > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > > to use up almost all of it, and there's only about 28kB left between
> > > > "bottom" and that 'rwx' mapping.
> > > > 
> > > > Still, that rwx mapping is interesting: it is a single page, and it
> > > > really is almost exactly 8MB below the stack.
> > > > 
> > > > In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> > > > the top of that odd one-page allocation (0xff7fd000).
> > > > 
> > > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > > mmap, with a condition to catch that particular one?
> > > 
> > > [...]
> > > 
> > > Found it, and it's now clear why only i386 is affected:
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> > 
> > This is really worrying. This doesn't look like a gap at all. It is a
> > mapping which actually contains a code and so we should absolutely not
> > allow to scribble over it. So I am afraid the only way forward is to
> > allow per process stack gap and run this particular program to have a
> > smaller gap. We basically have two ways. Either /proc/<pid>/$file or
> > a prctl inherited on exec. The later is a smaller code. What do you
> > think?
> 
> Distributions can do that, but what about all the other apps out there
> using JNI and private copies of the JRE?

Yes this sucks. I was thinking about something like run_legacy_stack
which would do
	prctl(PR_SET_STACK_GAP, 1, 0, 0, 0);
	execve(argv[1], argv+1, environ);

so we would have a way to start applications that start crashing with
the new setup without changing the default for all other applications.
The question is what to do if the execed task is suid because we
definitely do not want to allow tricking anybody to have smaller gap.

Or maybe just start the java with increased stack rlimit?

> Soemthing I noticed is that Java doesn't immediately use MAP_FIXED. 
> Look at os::pd_attempt_reserve_memory_at().  If the first, hinted,
> mmap() doesn't return the hinted address it then attempts to allocate
> huge areas (I'm not sure how intentional this is) and unmaps the
> unwanted parts.  Then os::workaround_expand_exec_shield_cs_limit() re-
> mmap()s the wanted part with MAP_FIXED.  If this fails at any point it
> is not a fatal error.
> 
> So if we change vm_start_gap() to take the stack limit into account
> (when it's finite) that should neutralise
> os::workaround_expand_exec_shield_cs_limit().  I'll try this.

I was already thinking about doing something like that to have a better
support for MAP_GROWSDOWN but then I just gave up because this would
require to cap RLIMIT_STACK for large values in order to not break
userspace again. The max value is not really clear to me.
-- 
Michal Hocko
SUSE Labs