Date: Fri, 12 Jun 2015 10:04:25 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Waiman Long <Waiman.Long@hp.com>, Thomas Gleixner <tglx@linutronix.de>,
        Denys Vlasenko <dvlasenk@redhat.com>, Borislav Petkov <bp@alien8.de>,
        Andrew Morton <akpm@linux-foundation.org>,
        Oleg Nesterov <oleg@redhat.com>, Andy Lutomirski <luto@amacapital.net>,
        linux-mml@vger.kernel.org,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Brian Gerst <brgerst@gmail.com>, "H. Peter Anvin" <hpa@zytor.com>,
        Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH 07/12] x86/virt/guest/xen: Remove use of pgd_list from
 the Xen guest code
Message-ID: <20150612080425.GC8759@gmail.com>
References: <1434031637-9091-1-git-send-email-mingo@kernel.org>
 <1434031637-9091-8-git-send-email-mingo@kernel.org>
 <CA+55aFzONQOsVXTECmynSAcDW_WnwOpLHMVDNUp0nsqrkYnw3Q@mail.gmail.com>
 <CA+55aFyDp_B9q4ReiMsngzQYTWRukrNBYy7uoGfFQ5nVredT9Q@mail.gmail.com>
 <20150612072302.GA7509@gmail.com>
 <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFxXM=DN32JqNmJ=JoMum5OPnsRohCry-=2T=LabX2hzVQ@mail.gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1868
Lines: 42


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Jun 12, 2015 00:23, "Ingo Molnar" <mingo@kernel.org> wrote:
> >
> > We might make it so: but that would mean restricting certain clone_flags 
> > variants - not sure that's possible with our current ABI usage?
> 
> We already do that. You can't share signal info unless you share the mm. And a 
> shared signal state is what defines a thread group.
> 
> So I think the only issue is that ->mm can become NULL when the thread group 
> leader dies - a non-NULL mm should always be shared among all threads.

Indeed, we do that in exit_mm().

So we could add tsk->mm_leader or so, which does not get cleared and which the 
scheduler does not look at, but I'm not sure it's entirely safe that way: we don't 
have a refcount, and when the last thread exits it becomes bogus for a small 
window until the zombie leader is unlinked from the task list.

To close that race we'd have __mmdrop() or so clear out tsk->mm_leader - but the 
task doing the mmdrop() might be a lazy thread totally unrelated to the original 
thread group so we don't know which tsk->mm_leader to clear out.

To solve that we'd have to track the leader owning an MM in mm_struct - which gets 
interesting for the exec() case where the thread group gets a new leader, so we'd 
have to re-link the mm's leader pointer there.

So unless I missed some simpler solution there a good number of steps where this 
could go wrong, in small looking race windows - how about we just live with 
iterating through all tasks instead of just all processes, once per 512 GB of 
memory mapped?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/