Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756731AbdLOOux (ORCPT ); Fri, 15 Dec 2017 09:50:53 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36288 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755322AbdLOOut (ORCPT ); Fri, 15 Dec 2017 09:50:49 -0500 Reply-To: dwalsh@redhat.com Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node To: Stephen Smalley , yangjihong , Casey Schaufler , "paul@paul-moore.com" , "eparis@parisplace.org" , "selinux@tycho.nsa.gov" , Lukas Vrabec , Petr Lautrbach Cc: "linux-kernel@vger.kernel.org" References: <1BC3DBD98AD61A4A9B2569BC1C0B4437D5D1F3@DGGEMM506-MBS.china.huawei.com> <1513178296.19161.8.camel@tycho.nsa.gov> <23c51943-51a4-4478-760f-375d02caa39b@schaufler-ca.com> <1513269771.18008.6.camel@tycho.nsa.gov> <79e41bd9-2570-7386-d462-d242a18fb786@schaufler-ca.com> <1513271755.18008.11.camel@tycho.nsa.gov> <33e2eb10-4acc-f9db-a87d-ce63b5b48c1e@schaufler-ca.com> <196418a9-9c9d-fdb5-f5a1-9abc391adc83@redhat.com> <1BC3DBD98AD61A4A9B2569BC1C0B4437D5D398@DGGEMM506-MBS.china.huawei.com> <1513346206.2345.2.camel@tycho.nsa.gov> From: Daniel Walsh Organization: Red Hat Message-ID: <1b8709aa-2a08-8cde-13c7-79bb93c791c6@redhat.com> Date: Fri, 15 Dec 2017 09:50:44 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <1513346206.2345.2.camel@tycho.nsa.gov> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Fri, 15 Dec 2017 14:50:49 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13084 Lines: 334 On 12/15/2017 08:56 AM, Stephen Smalley wrote: > On Fri, 2017-12-15 at 03:09 +0000, yangjihong wrote: >> On 12/15/2017 10:31 PM, yangjihong wrote: >>> On 12/14/2017 12:42 PM, Casey Schaufler wrote: >>>> On 12/14/2017 9:15 AM, Stephen Smalley wrote: >>>>> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote: >>>>>> On 12/14/2017 8:42 AM, Stephen Smalley wrote: >>>>>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote: >>>>>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote: >>>>>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote: >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> I am doing stressing testing on 3.10 kernel(centos >>>>>>>>>> 7.4), to >>>>>>>>>> constantly starting numbers of docker ontainers with >>>>>>>>>> selinux >>>>>>>>>> enabled, and after about 2 days, the kernel >>>>>>>>>> softlockup panic: >>>>>>>>>> [] >>>>>>>>>> sched_show_task+0xb8/0x120 >>>>>>>>>> [] show_lock_info+0x20f/0x3a0 >>>>>>>>>> [] watchdog_timer_fn+0x1da/0x2f0 >>>>>>>>>> [] ? >>>>>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40 >>>>>>>>>> [] >>>>>>>>>> __hrtimer_run_queues+0xd2/0x260 >>>>>>>>>> [] hrtimer_interrupt+0xb0/0x1e0 >>>>>>>>>> [] >>>>>>>>>> local_apic_timer_interrupt+0x37/0x60 >>>>>>>>>> [] >>>>>>>>>> smp_apic_timer_interrupt+0x50/0x140 >>>>>>>>>> [] apic_timer_interrupt+0x6d/0x80 >>>>>>>>>> [] ? >>>>>>>>>> sidtab_context_to_sid+0xb3/0x480 >>>>>>>>>> [] ? >>>>>>>>>> sidtab_context_to_sid+0x110/0x480 >>>>>>>>>> [] ? >>>>>>>>>> mls_setup_user_range+0x145/0x250 >>>>>>>>>> [] >>>>>>>>>> security_get_user_sids+0x3f7/0x550 >>>>>>>>>> [] sel_write_user+0x12b/0x210 >>>>>>>>>> [] ? sel_write_member+0x200/0x200 >>>>>>>>>> [] >>>>>>>>>> selinux_transaction_write+0x48/0x80 >>>>>>>>>> [] vfs_write+0xbd/0x1e0 >>>>>>>>>> [] SyS_write+0x7f/0xe0 >>>>>>>>>> [] system_call_fastpath+0x16/0x1b >>>>>>>>>> >>>>>>>>>> My opinion: >>>>>>>>>> when the docker container starts, it would mount >>>>>>>>>> overlay >>>>>>>>>> filesystem with different selinux context, mount >>>>>>>>>> point such as: >>>>>>>>>> overlay on >>>>>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952ea >>>>>>>>>> e4f6cb0f >>>>>>>>>> 07b4 >>>>>>>>>> bc32 >>>>>>>>>> 6cb07495ca08fc9ddb66/merged type overlay >>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox >>>>>>>>>> _file_t: >>>>>>>>>> s0:c >>>>>>>>>> 414, >>>>>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV >>>>>>>>>> 5CFWLADP >>>>>>>>>> ARHH >>>>>>>>>> WY7: >>>>>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS >>>>>>>>>> :/var/li >>>>>>>>>> b/do >>>>>>>>>> cker >>>>>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/ >>>>>>>>>> lib/dock >>>>>>>>>> er/o >>>>>>>>>> verl >>>>>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07 >>>>>>>>>> 495ca08f >>>>>>>>>> c9dd >>>>>>>>>> b66/ >>>>>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92f >>>>>>>>>> c4530e0e >>>>>>>>>> 952e >>>>>>>>>> ae4f >>>>>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work) >>>>>>>>>> shm on >>>>>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755 >>>>>>>>>> 793449c9 >>>>>>>>>> 1327 >>>>>>>>>> ca57 >>>>>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs >>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob >>>>>>>>>> ject_r:s >>>>>>>>>> virt >>>>>>>>>> _san >>>>>>>>>> dbox_file_t:s0:c414,c873",size=65536k) >>>>>>>>>> overlay on >>>>>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d02 >>>>>>>>>> 55991dfb >>>>>>>>>> 7258 >>>>>>>>>> cbca >>>>>>>>>> 14ff6d165b94353eefab/merged type overlay >>>>>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox >>>>>>>>>> _file_t: >>>>>>>>>> s0:c >>>>>>>>>> 431, >>>>>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLF >>>>>>>>>> B7ANVRHP >>>>>>>>>> AVRC >>>>>>>>>> RSS: >>>>>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI >>>>>>>>>> ,upperdi >>>>>>>>>> r=/v >>>>>>>>>> ar/l >>>>>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991d >>>>>>>>>> fb7258cb >>>>>>>>>> ca14 >>>>>>>>>> ff6d >>>>>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/ >>>>>>>>>> 38d1544d >>>>>>>>>> 0801 >>>>>>>>>> 45c7 >>>>>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work >>>>>>>>>> ) >>>>>>>>>> shm on >>>>>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944 >>>>>>>>>> 537a4bce >>>>>>>>>> dc1d >>>>>>>>>> cf05 >>>>>>>>>> a65866458523ffd4a71614/shm type tmpfs >>>>>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:ob >>>>>>>>>> ject_r:s >>>>>>>>>> virt >>>>>>>>>> _san >>>>>>>>>> dbox_file_t:s0:c431,c651",size=65536k) >>>>>>>>>> >>>>>>>>>> sidtab_search_context check the context whether is in >>>>>>>>>> the sidtab >>>>>>>>>> list, If not found, a new node is generated and >>>>>>>>>> insert into the >>>>>>>>>> list, As the number of containers is >>>>>>>>>> increasing, context nodes >>>>>>>>>> are also more and more, we tested the final number of >>>>>>>>>> nodes >>>>>>>>>> reached >>>>>>>>>> 300,000 +, >>>>>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which >>>>>>>>>> will lead >>>>>>>>>> to the system softlockup. >>>>>>>>>> >>>>>>>>>> Is this a selinux bug? When filesystem umount, why >>>>>>>>>> context node >>>>>>>>>> is not deleted? I cannot find the relevant function >>>>>>>>>> to delete >>>>>>>>>> the node in sidtab.c >>>>>>>>>> >>>>>>>>>> Thanks for reading and looking forward to your reply. >>>>>>>>> So, does docker just keep allocating a unique category >>>>>>>>> set for >>>>>>>>> every new container, never reusing them even if the >>>>>>>>> container is >>>>>>>>> destroyed? >>>>>>>>> That would be a bug in docker IMHO. Or are you >>>>>>>>> creating an >>>>>>>>> unbounded number of containers and never destroying the >>>>>>>>> older >>>>>>>>> ones? >>>>>>>> You can't reuse the security context. A process in >>>>>>>> ContainerA >>>>>>>> sends a labeled packet to MachineB. ContainerA goes away >>>>>>>> and its >>>>>>>> context is recycled in ContainerC. MachineB responds some >>>>>>>> time >>>>>>>> later, again with a labeled packet. ContainerC gets >>>>>>>> information >>>>>>>> intended for ContainerA, and uses the information to take >>>>>>>> over the >>>>>>>> Elbonian government. >>>>>>> Docker isn't using labeled networking (nor is anything else >>>>>>> by >>>>>>> default; it is only enabled if explicitly configured). >>>>>> If labeled networking weren't an issue we'd have full >>>>>> security >>>>>> module stacking by now. Yes, it's an edge case. If you want >>>>>> to use >>>>>> labeled NFS or a local filesystem that gets mounted in each >>>>>> container (don't tell me that nobody would do that) you've >>>>>> got the >>>>>> same problem. >>>>> Even if someone were to configure labeled networking, Docker is >>>>> not >>>>> presently relying on that or SELinux network enforcement for >>>>> any >>>>> security properties, so it really doesn't matter. >>>> True enough. I can imagine a use case, but as you point out, it >>>> would >>>> be a very complex configuration and coordination exercise using >>>> SELinux. >>>> >>>>> And if they wanted >>>>> to do that, they'd have to coordinate category assignments >>>>> across all >>>>> systems involved, for which no facility exists AFAIK. If you >>>>> have >>>>> two docker instances running on different hosts, I'd wager that >>>>> they >>>>> can hand out the same category sets today to different >>>>> containers. >>>>> >>>>> With respect to labeled NFS, that's also not the default for >>>>> nfs >>>>> mounts, so again it is a custom configuration and Docker isn't >>>>> relying on it for any guarantees today. For local filesystems, >>>>> they >>>>> would normally be context-mounted or using genfscon rather >>>>> than >>>>> xattrs in order to be accessible to the container, thus no >>>>> persistent >>>>> storage of the category sets. >>> Well Kubernetes and OpenShift do set the labels to be the same >>> within a project, and they can manage across nodes. But yes we are >>> not using labeled networking at this point. >>>> I know that is the intended configuration, but I see people do >>>> all >>>> sorts of stoopid things for what they believe are good reasons. >>>> Unfortunately, lots of people count on containers to provide >>>> isolation, but create "solutions" for data sharing that defeat >>>> it. >>>> >>>>> Certainly docker could provide an option to not reuse category >>>>> sets, >>>>> but making that the default is not sane and just guarantees >>>>> exhaustion of the SID and context space (just create and tear >>>>> down >>>>> lots of containers every day or more frequently). >>>> It seems that Docker might have a similar issue with UIDs, but >>>> it >>>> takes longer to run out of UIDs than sidtab entries. >>>> >>>>>>>>> On the selinux userspace side, we'd also like to >>>>>>>>> eliminate the >>>>>>>>> use of /sys/fs/selinux/user (sel_write_user -> >>>>>>>>> security_get_user_sids) entirely, which is what >>>>>>>>> triggered this >>>>>>>>> for you. >>>>>>>>> >>>>>>>>> We cannot currently delete a sidtab node because we >>>>>>>>> have no way >>>>>>>>> of knowing if there are any lingering references to the >>>>>>>>> SID. >>>>>>>>> Fixing that would require reference-counted SIDs, which >>>>>>>>> goes >>>>>>>>> beyond just SELinux since SIDs/secids are returned by >>>>>>>>> LSM hooks >>>>>>>>> and cached in other kernel data structures. >>>>>>>> You could delete a sidtab node. The code already deals >>>>>>>> with >>>>>>>> unfindable SIDs. The issue is that eventually you run out >>>>>>>> of SIDs. >>>>>>>> Then you are forced to recycle SIDs, which leads to the >>>>>>>> overthrow >>>>>>>> of the Elbonian government. >>>>>>> We don't know when we can safely delete a sidtab node since >>>>>>> SIDs >>>>>>> aren't reference counted and we can't know whether it is >>>>>>> still in >>>>>>> use somewhere in the kernel. Doing so prematurely would >>>>>>> lead to >>>>>>> the SID being remapped to the unlabeled context, and then >>>>>>> likely to >>>>>>> undesired denials. >>>>>> I would suggest that if you delete a sidtab node and someone >>>>>> comes >>>>>> along later and tries to use it that denial is exactly what >>>>>> you >>>>>> would desire. I don't see any other rational action. >>>>> Yes, if we know that the SID wasn't in use at the time we tore >>>>> it down. >>>>> But if we're just randomly deleting sidtab entries based on >>>>> age or >>>>> something (since we have no reference count), we'll almost >>>>> certainly >>>>> encounter situations where a SID hasn't been accessed in a long >>>>> time >>>>> but is still being legitimately cached somewhere. Just a file >>>>> that >>>>> hasn't been accessed in a while might have that SID still >>>>> cached in >>>>> its inode security blob, or anywhere else. >>>>> >>>>>>>>> sidtab_search_context() could no doubt be optimized for >>>>>>>>> the >>>>>>>>> negative case; there was an earlier optimization for >>>>>>>>> the positive >>>>>>>>> case by adding a cache to sidtab_context_to_sid() prior >>>>>>>>> to >>>>>>>>> calling it. It's a reverse lookup in the sidtab. >>>>>>>> This seems like a bad idea. >>>>>>> Not sure what you mean, but it can certainly be changed to >>>>>>> at least >>>>>>> use a hash table for these reverse lookups. >>>>>>> >>>>>>> >>>> >>>> >> Thanks for reply and discussion. >> I think docker container is only a case, Is it possible there is a >> similar way, through some means of attack, triggered a constantly >> increasing of SIDs list, eventually leading to the system panic? >> >> I think the issue is that is takes too long to search SID node when >> SIDs list too large, >> If can optimize the node's data structure(ie : tree structure) or >> search algorithm to ensure that traversing all nodes can be very >> short time even in many nodes, maybe it can solve the problem. >> Or, in sidtab.c provides "delete_sidtab_node" interface, when umount >> fs, delete the SID node. Because when fs is umounted, the SID is >> useless, could delete it to control the size of SIDs list. >> >> Thanks for reading and looking forward to your reply. > We cannot safely delete entries in the sidtab without first adding > reference counting of SIDs, which goes beyond just SELinux since they > are cached in other kernel data structures and returned by LSM hooks. > That's a non-trivial undertaking. > > Far more practical in the near term would be to introduce a hash table > or other mechanism for efficient reverse lookups in the sidtab. Are > you offering to implement that or just requesting it? > > Independent of that, docker should support reuse of category sets when > containers are deleted, at least as an option and probably as the > default. > > Docker does reuse categories of containers that are removed, by default.