Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753877AbdLNSLv (ORCPT ); Thu, 14 Dec 2017 13:11:51 -0500 Received: from mx1.redhat.com ([209.132.183.28]:60888 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753461AbdLNSLu (ORCPT ); Thu, 14 Dec 2017 13:11:50 -0500 Reply-To: dwalsh@redhat.com Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node To: Casey Schaufler , Stephen Smalley , yangjihong , "paul@paul-moore.com" , "eparis@parisplace.org" , "selinux@tycho.nsa.gov" , Lukas Vrabec , Petr Lautrbach Cc: "linux-kernel@vger.kernel.org" References: <1BC3DBD98AD61A4A9B2569BC1C0B4437D5D1F3@DGGEMM506-MBS.china.huawei.com> <1513178296.19161.8.camel@tycho.nsa.gov> <23c51943-51a4-4478-760f-375d02caa39b@schaufler-ca.com> <1513269771.18008.6.camel@tycho.nsa.gov> <79e41bd9-2570-7386-d462-d242a18fb786@schaufler-ca.com> <1513271755.18008.11.camel@tycho.nsa.gov> <33e2eb10-4acc-f9db-a87d-ce63b5b48c1e@schaufler-ca.com> From: Daniel Walsh Organization: Red Hat Message-ID: <196418a9-9c9d-fdb5-f5a1-9abc391adc83@redhat.com> Date: Thu, 14 Dec 2017 13:11:46 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <33e2eb10-4acc-f9db-a87d-ce63b5b48c1e@schaufler-ca.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Thu, 14 Dec 2017 18:11:50 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10178 Lines: 243 On 12/14/2017 12:42 PM, Casey Schaufler wrote: > On 12/14/2017 9:15 AM, Stephen Smalley wrote: >> On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote: >>> On 12/14/2017 8:42 AM, Stephen Smalley wrote: >>>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote: >>>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote: >>>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote: >>>>>>> Hello, >>>>>>> >>>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to >>>>>>> constantly starting numbers of docker ontainers with selinux >>>>>>> enabled, >>>>>>> and after about 2 days, the kernel softlockup panic: >>>>>>> [] sched_show_task+0xb8/0x120 >>>>>>> [] show_lock_info+0x20f/0x3a0 >>>>>>> [] watchdog_timer_fn+0x1da/0x2f0 >>>>>>> [] ? >>>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40 >>>>>>> [] __hrtimer_run_queues+0xd2/0x260 >>>>>>> [] hrtimer_interrupt+0xb0/0x1e0 >>>>>>> [] local_apic_timer_interrupt+0x37/0x60 >>>>>>> [] smp_apic_timer_interrupt+0x50/0x140 >>>>>>> [] apic_timer_interrupt+0x6d/0x80 >>>>>>> [] ? >>>>>>> sidtab_context_to_sid+0xb3/0x480 >>>>>>> [] ? sidtab_context_to_sid+0x110/0x480 >>>>>>> [] ? mls_setup_user_range+0x145/0x250 >>>>>>> [] security_get_user_sids+0x3f7/0x550 >>>>>>> [] sel_write_user+0x12b/0x210 >>>>>>> [] ? sel_write_member+0x200/0x200 >>>>>>> [] selinux_transaction_write+0x48/0x80 >>>>>>> [] vfs_write+0xbd/0x1e0 >>>>>>> [] SyS_write+0x7f/0xe0 >>>>>>> [] system_call_fastpath+0x16/0x1b >>>>>>> >>>>>>> My opinion: >>>>>>> when the docker container starts, it would mount overlay >>>>>>> filesystem >>>>>>> with different selinux context, mount point such as: >>>>>>> overlay on >>>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f >>>>>>> 07b4 >>>>>>> bc32 >>>>>>> 6cb07495ca08fc9ddb66/merged type overlay >>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t: >>>>>>> s0:c >>>>>>> 414, >>>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP >>>>>>> ARHH >>>>>>> WY7: >>>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li >>>>>>> b/do >>>>>>> cker >>>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock >>>>>>> er/o >>>>>>> verl >>>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f >>>>>>> c9dd >>>>>>> b66/ >>>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e >>>>>>> 952e >>>>>>> ae4f >>>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work) >>>>>>> shm on >>>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9 >>>>>>> 1327 >>>>>>> ca57 >>>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs >>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s >>>>>>> virt >>>>>>> _san >>>>>>> dbox_file_t:s0:c414,c873",size=65536k) >>>>>>> overlay on >>>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb >>>>>>> 7258 >>>>>>> cbca >>>>>>> 14ff6d165b94353eefab/merged type overlay >>>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t: >>>>>>> s0:c >>>>>>> 431, >>>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP >>>>>>> AVRC >>>>>>> RSS: >>>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi >>>>>>> r=/v >>>>>>> ar/l >>>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb >>>>>>> ca14 >>>>>>> ff6d >>>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d >>>>>>> 0801 >>>>>>> 45c7 >>>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work) >>>>>>> shm on >>>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce >>>>>>> dc1d >>>>>>> cf05 >>>>>>> a65866458523ffd4a71614/shm type tmpfs >>>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s >>>>>>> virt >>>>>>> _san >>>>>>> dbox_file_t:s0:c431,c651",size=65536k) >>>>>>> >>>>>>> sidtab_search_context check the context whether is in the >>>>>>> sidtab >>>>>>> list, If not found, a new node is generated and insert into >>>>>>> the >>>>>>> list, >>>>>>> As the number of containers is increasing, context nodes are >>>>>>> also >>>>>>> more and more, we tested the final number of nodes reached >>>>>>> 300,000 +, >>>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will >>>>>>> lead to >>>>>>> the >>>>>>> system softlockup. >>>>>>> >>>>>>> Is this a selinux bug? When filesystem umount, why context >>>>>>> node >>>>>>> is >>>>>>> not deleted? I cannot find the relevant function to delete >>>>>>> the >>>>>>> node >>>>>>> in sidtab.c >>>>>>> >>>>>>> Thanks for reading and looking forward to your reply. >>>>>> So, does docker just keep allocating a unique category set for >>>>>> every >>>>>> new container, never reusing them even if the container is >>>>>> destroyed? >>>>>> That would be a bug in docker IMHO. Or are you creating an >>>>>> unbounded >>>>>> number of containers and never destroying the older ones? >>>>> You can't reuse the security context. A process in ContainerA >>>>> sends >>>>> a labeled packet to MachineB. ContainerA goes away and its >>>>> context >>>>> is recycled in ContainerC. MachineB responds some time later, >>>>> again >>>>> with a labeled packet. ContainerC gets information intended for >>>>> ContainerA, and uses the information to take over the Elbonian >>>>> government. >>>> Docker isn't using labeled networking (nor is anything else by >>>> default; >>>> it is only enabled if explicitly configured). >>> If labeled networking weren't an issue we'd have full security >>> module stacking by now. Yes, it's an edge case. If you want to >>> use labeled NFS or a local filesystem that gets mounted in each >>> container (don't tell me that nobody would do that) you've got >>> the same problem. >> Even if someone were to configure labeled networking, Docker is not >> presently relying on that or SELinux network enforcement for any >> security properties, so it really doesn't matter. > True enough. I can imagine a use case, but as you point out, it > would be a very complex configuration and coordination exercise > using SELinux. > >> And if they wanted >> to do that, they'd have to coordinate category assignments across all >> systems involved, for which no facility exists AFAIK. If you have two >> docker instances running on different hosts, I'd wager that they can >> hand out the same category sets today to different containers. >> >> With respect to labeled NFS, that's also not the default for nfs >> mounts, so again it is a custom configuration and Docker isn't relying >> on it for any guarantees today. For local filesystems, they would >> normally be context-mounted or using genfscon rather than xattrs in >> order to be accessible to the container, thus no persistent storage of >> the category sets. Well Kubernetes and OpenShift do set the labels to be the same within a project, and they can manage across nodes. But yes we are not using labeled networking at this point. > I know that is the intended configuration, but I see people do > all sorts of stoopid things for what they believe are good reasons. > Unfortunately, lots of people count on containers to provide > isolation, but create "solutions" for data sharing that defeat it. > >> Certainly docker could provide an option to not reuse category sets, >> but making that the default is not sane and just guarantees exhaustion >> of the SID and context space (just create and tear down lots of >> containers every day or more frequently). > It seems that Docker might have a similar issue with UIDs, > but it takes longer to run out of UIDs than sidtab entries. > >>>>>> On the selinux userspace side, we'd also like to eliminate the >>>>>> use >>>>>> of >>>>>> /sys/fs/selinux/user (sel_write_user -> security_get_user_sids) >>>>>> entirely, which is what triggered this for you. >>>>>> >>>>>> We cannot currently delete a sidtab node because we have no way >>>>>> of >>>>>> knowing if there are any lingering references to the >>>>>> SID. Fixing >>>>>> that >>>>>> would require reference-counted SIDs, which goes beyond just >>>>>> SELinux >>>>>> since SIDs/secids are returned by LSM hooks and cached in other >>>>>> kernel >>>>>> data structures. >>>>> You could delete a sidtab node. The code already deals with >>>>> unfindable >>>>> SIDs. The issue is that eventually you run out of SIDs. Then you >>>>> are >>>>> forced to recycle SIDs, which leads to the overthrow of the >>>>> Elbonian >>>>> government. >>>> We don't know when we can safely delete a sidtab node since SIDs >>>> aren't >>>> reference counted and we can't know whether it is still in use >>>> somewhere in the kernel. Doing so prematurely would lead to the >>>> SID >>>> being remapped to the unlabeled context, and then likely to >>>> undesired >>>> denials. >>> I would suggest that if you delete a sidtab node and someone >>> comes along later and tries to use it that denial is exactly >>> what you would desire. I don't see any other rational action. >> Yes, if we know that the SID wasn't in use at the time we tore it down. >> But if we're just randomly deleting sidtab entries based on age or >> something (since we have no reference count), we'll almost certainly >> encounter situations where a SID hasn't been accessed in a long time >> but is still being legitimately cached somewhere. Just a file that >> hasn't been accessed in a while might have that SID still cached in its >> inode security blob, or anywhere else. >> >>>>>> sidtab_search_context() could no doubt be optimized for the >>>>>> negative >>>>>> case; there was an earlier optimization for the positive case >>>>>> by >>>>>> adding >>>>>> a cache to sidtab_context_to_sid() prior to calling it. It's a >>>>>> reverse >>>>>> lookup in the sidtab. >>>>> This seems like a bad idea. >>>> Not sure what you mean, but it can certainly be changed to at least >>>> use >>>> a hash table for these reverse lookups. >>>> >>>> > > >