Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752979AbYLIDWo (ORCPT ); Mon, 8 Dec 2008 22:22:44 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751246AbYLIDWg (ORCPT ); Mon, 8 Dec 2008 22:22:36 -0500 Received: from e32.co.us.ibm.com ([32.97.110.150]:55876 "EHLO e32.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751040AbYLIDWf (ORCPT ); Mon, 8 Dec 2008 22:22:35 -0500 Date: Mon, 8 Dec 2008 19:22:07 -0800 From: Sukadev Bhattiprolu To: Roland McGrath Cc: oleg@redhat.com, ebiederm@xmission.com, daniel@hozac.com, xemul@openvz.org, containers@lists.osdl.org, linux-kernel@vger.kernel.org, sukadev@us.ibm.com Subject: Re: [RFC][PATCH 3/5] Determine if sender is from ancestor ns Message-ID: <20081209032207.GA30016@us.ibm.com> References: <20081126034242.GA23120@us.ibm.com> <20081126034611.GC23238@us.ibm.com> <20081204010636.A8465FC053@magilla.sf.frob.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20081204010636.A8465FC053@magilla.sf.frob.com> X-Operating-System: Linux 2.0.32 on an i486 User-Agent: Mutt/1.5.15+20070412 (2007-04-11) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5302 Lines: 105 Roland McGrath [roland@redhat.com] wrote: Thanks for the review and your comments. > > I don't quarrel with the intent, but I don't like this approach. > I think we can do it more cleanly. I don't have it all worked out, > but a couple of thoughts come to mind. > > Firstly, it's no problem to split SIGNAL_UNKILLABLE out into multiple > flags. It's quite noninvasive, only signal.c looks at those flags. Then > we can implement the signal magic cleanly keyed on a few separate flags, > using different flag combinations for global init and container inits. Ok. Sounds good. Let say for this discussion, container-inits set SIGNAL_UNKILLABLE_CINIT, global inits continue to set SIGNAL_UNKILLABLE. > > I don't mind a feature that lets an unprivileged container init magically > ignore normal exit signals. That just does something it could do itself > with sigaction, except its /proc/pid/status "SigIgn:" line will lie. Yes, but after Bastian mentioned it recently, I was wondering if we can keep "SigIgn:" honest by tweaking procfs code - i.e have procfs check the SIGNAL_UNKILLABLE flags and include SIG_DFL exit signals in the SigIgn: line. We could include blocked SIG_DFL signals in the SigIgn: line ? Hacky ? > That's OK. Key that on its own flag and set that in both inits. Just > check that flag in prepare_signal() and in get_signal_to_deliver(). > > It's a different story for the signals that cannot be caught, blocked, or > ignored, SIGKILL and SIGSTOP. At least when sent from outside the > container, circumventing either of those would be a privilege escalation > from what's always been understood by admins and so on. So if SIGKILL or SIGSTOP are sent from inside the container, we drop them in prepare_signal() if SIGNAL_UNKILLABLE flags are set. If sent from outside the container, we queue them and get_signal_to_deliver() delivers these unless SIGNAL_UNKILLABLE is set (i.e the signals are delivered to cinits but ignored for global init). > > It's convenient for the implementation that we can treat these differently > from other signals, precisely because they can never be caught, blocked, or > ignored. That is, when we decide in prepare_signal() to queue a SIGKILL or > SIGSTOP, it can never turn out that we're later going to drop it because it > went from caught or ignored or blocked to uncaught, unignored, and > unblocked. (It's only because of that possibility that there is any need > to check for a suppressed exit signal in get_signal_to_deliver() rather > than only in prepare_signal().) That means that when the decision hinges > on the namespace correlation of the sender and receiver, you can check that > when it's handy (current vs p in prepare_signal) rather than trying to > reconstruct the answer for a queued signal. Yes. One thing I am not clear on yet is how we decide in prepare_signal() if it is safe to check the sender's namespace (since caller of send_signal() could be an interrupt handler or workqueue). As I was discussing with Bastian Blank, SI_FROMUSER() is not sufficient since SI_ASYNCIO appears as an user-space signal. Now is SI_ASYNCIO really supposed to be a user signal ? Or like SI_MESGQ and SI_POLL can it be a kernel signal ? If we fix SI_ASYNCIO, can we safely use SI_FROMUSER() for this or could there be other 'user' signals from kernel ? > > Finally, one more thought. This may be moot for the problem at hand if you > take the approach I just suggested, but probably should be fixed anyway. > It seems to me that the si_pid and si_uid fields of siginfo_t ought to be > translated to the namespaces of the receiver. I think it makes most sense > to do this on the front end, i.e. in the callers that fill in the siginfo_t > in the first place (sys_kill et al, or maybe a few layers down?). > Currently it's inconsistent, but mostly wrong. Yes, that makes a lot of sense to push that initialization to the callers. We have been going through and identifying the 'si_pid's that need to be fixed. Nadia sent out one patch for ipc/mqueue.c > do_notify_parent() and > do_notify_parent_cldstop() use the receiver's namespace to compute si_pid, > but the rest of the signal.c routines do not. A free side effect of doing > this is that si_pid for a sender whose PID is not visible to the receiver > (i.e. outside its container) would be distinctively 0 or -1 or something. > (-1 might be the best choice, since si_[pu]id=0 already arises now in case > of signal queue exhaustion and the like.) Hence, possibly one could simply > use si_pid>0 as a "sent from inside the container" check on a queued siginfo_t. Yes, its probably moot with the other approach, but the only drawback with relying on the "->si_pid > 0" check in get_signal_to_deliver() was that allocation of siginfo_t could have failed. In which case it would be unclear whether signal was from parent or same namespace. But if we fix si_pid problems and correctly pass in siginfo_t, can we use "si_pid > 0" check in prepare_signal() to decide whether or not to queue the signal ? Sukadev -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/