Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752137AbcDTQCo (ORCPT ); Wed, 20 Apr 2016 12:02:44 -0400 Received: from out01.mta.xmission.com ([166.70.13.231]:48752 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751731AbcDTQCl (ORCPT ); Wed, 20 Apr 2016 12:02:41 -0400 From: ebiederm@xmission.com (Eric W. Biederman) To: Konstantin Khlebnikov Cc: Linus Torvalds , "H. Peter Anvin" , Andy Lutomirski , security@debian.org, "security\@kernel.org" , Al Viro , "security\@ubuntu.com \>\> security" , Peter Hurley , Serge Hallyn , Willy Tarreau , Aurelien Jarno , One Thousand Gnomes , Jann Horn , Greg KH , Linux Kernel Mailing List , Jiri Slaby , Florian Weimer References: <878u0s3orx.fsf_-_@x220.int.ebiederm.org> <87bn5ij0x1.fsf@x220.int.ebiederm.org> <78205895-E11D-417F-91DC-4BCA0B61A122@zytor.com> <570D4781.3070600@zytor.com> <877ffyzy1j.fsf_-_@x220.int.ebiederm.org> <87twixgsnq.fsf@x220.int.ebiederm.org> <87oa95gevf.fsf_-_@x220.int.ebiederm.org> <87mvoo8h3d.fsf@x220.int.ebiederm.org> Date: Wed, 20 Apr 2016 10:50:21 -0500 In-Reply-To: (Konstantin Khlebnikov's message of "Wed, 20 Apr 2016 18:34:31 +0300") Message-ID: <87mvoo5lfm.fsf@x220.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-AID: U2FsdGVkX1/32tf6W5SPa0Ju6+YYgPfm5PtEpiTa5Iw= X-SA-Exim-Connect-IP: 97.119.105.151 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.7 XMSubLong Long Subject * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa01 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa01 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Konstantin Khlebnikov X-Spam-Relay-Country: X-Spam-Timing: total 12444 ms - load_scoreonly_sql: 0.05 (0.0%), signal_user_changed: 3.9 (0.0%), b_tie_ro: 2.8 (0.0%), parse: 1.42 (0.0%), extract_message_metadata: 32 (0.3%), get_uri_detail_list: 4.4 (0.0%), tests_pri_-1000: 10 (0.1%), tests_pri_-950: 2.1 (0.0%), tests_pri_-900: 1.69 (0.0%), tests_pri_-400: 46 (0.4%), check_bayes: 44 (0.4%), b_tokenize: 19 (0.2%), b_tok_get_all: 12 (0.1%), b_comp_prob: 6 (0.0%), b_tok_touch_all: 3.6 (0.0%), b_finish: 0.91 (0.0%), tests_pri_0: 784 (6.3%), check_dkim_signature: 1.05 (0.0%), check_dkim_adsp: 22 (0.2%), tests_pri_500: 11557 (92.9%), poll_dns_idle: 11541 (92.7%), rewrite_mail: 0.00 (0.0%) Subject: Re: [PATCH] devpts: Make each mount of devpts an independent filesystem. X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Wed, 24 Sep 2014 11:00:52 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3200 Lines: 71 Konstantin Khlebnikov writes: > On Wed, Apr 20, 2016 at 5:55 PM, Eric W. Biederman > wrote: >> Linus Torvalds writes: >> >>> On Tue, Apr 19, 2016 at 9:36 PM, Konstantin Khlebnikov wrote: >>>> On Wed, Apr 20, 2016 at 6:04 AM, Eric W. Biederman >>>>> >>>>> The kernel.pty.reserve sysctl is neutered with no way currently >>>>> implemented to be able to use the reserved ptys. >>>> >>>> I think we could convert this into reserve for init user namespace, >>>> ssh in host will work even if containers eaten all ptys. >>> >>> Yes. That's basically how it effectively worked before (ie everything >>> but the initial non-newinstance devpts mount would be limited to the >>> non-reserved numbers). >>> >>> We required the non-init namespaces to do a newinstance mount, so the >>> whole test for "newinstance" was effectively the same thing as just >>> checking for the init namespace from a security standpoint. >>> >>> And in fact, rewriting it in that form (ie checking for init_ns) would >>> just make it much more obvious what the intent it. >> >> How does this sound. >> >> When mounting a devpts filesystem. We look at the caller (aka current) >> and if we are in the initial mount namespace set a flag in fsi that >> allows that instance of devpts to draw into the reserve pool. > > Maybe just check current user namespace when task opens /dev/ptmx? > > IIRR now check looks like: count < limit - (newinstance ? reserved : 0). > So, it will be count < limit - (current_in_init_userns ? 0 : newinstance). Looking at current user namespace really is not enough. Lots of container solutions at least historically (which means deployed right now) don't use the user namespace. I can see an argument to make the check: "capable(CAP_SYS_RESOURCE)". Although for pty applications I don't know if that is particularly meaningful. I am a little dubious of making it a check at allocation time rather than at mount time. The issue is that tty allocation is an unprivileged operation. I expect applications such as sshd (the one that really matters) will have droped privileges by the time they allocation a pty. So I feel much more comfortable with a model where things are arranged so that applications within access of the devpts filesystem can use it (and are not limited), and applications not in range can't. Roughly the authenticate at open time model. Also what devpts implements today. The is also the question of how things should work if you are running in a system where every new daemon, and every new login is in it's own mount namespace. Allowing each of these to have a distinct /tmp directory. I believe systemd systems are well on their way to doing that today. As such it does not seem appropriate to check the mount namespace of the opener of the tty. Who knows we may not be long until the pty master lives in some very tight bubble where it can barely do anything (as that is the program that talks to the network) and user namespaces are used as part of the enforcement of that. For all of those reasons, a permission check in devpts_pty_new seems like the wrong place. Eric