Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753997AbdLNRmh (ORCPT ); Thu, 14 Dec 2017 12:42:37 -0500 Received: from sonic304-17.consmr.mail.bf2.yahoo.com ([74.6.128.40]:44937 "EHLO sonic304-17.consmr.mail.bf2.yahoo.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753689AbdLNRme (ORCPT ); Thu, 14 Dec 2017 12:42:34 -0500 X-YMail-OSG: xYAqmOUVM1nORoZri2CxBJCbJKVd9K07EBr19NSrwZ62xq8zJdEfpYylIDUn3dX 227QDz4FtEqunL_uPVwwyhLl9UIRO0t8BjfXy70d_cl_akhuedx5jCbZOFugpCXAXdST1wj58GVw OdFJi3lz4XcxXD_hUdbL2LBIL6VhcNJx4lBcP7R7_nPlq54LUmFNWafSpMuaC48pKOqS_2II1TOi 9wNOJtvX0.fh3EPoz0fhLjUKcWYaK1t4ivZv.s6nXM26kA7HEiqImnD1mzgDGF5LWINQxsJJaUF1 TTuFlf9w.oBk89wI53K8slpJGpYp_8BOF0_MqfGSq3dAnDZABJDdjQzMo_43BM4Y6zj4zXv6ijXv I2PNN5zsL1naK.wpKUddth4V.sRnPYq1MBPue_SnVKHMnKzMp3rocuqr6Lc_5P9D6gpasWGqYoPo KCOEEhpwjCQyg3igEUcrs3j9EeonX_KjmZzSbvxhQAdvyTqv20zxjgdiSORZXKZGD9rYTAP0gtOf M_UxgK_7WZKRYg770QZLAUFQN_.85A1QGsg-- Subject: Re: [BUG]kernel softlockup due to sidtab_search_context run for long time because of too many sidtab context node To: Stephen Smalley , yangjihong , "paul@paul-moore.com" , "eparis@parisplace.org" , "selinux@tycho.nsa.gov" , Daniel J Walsh , Lukas Vrabec , Petr Lautrbach Cc: "linux-kernel@vger.kernel.org" References: <1BC3DBD98AD61A4A9B2569BC1C0B4437D5D1F3@DGGEMM506-MBS.china.huawei.com> <1513178296.19161.8.camel@tycho.nsa.gov> <23c51943-51a4-4478-760f-375d02caa39b@schaufler-ca.com> <1513269771.18008.6.camel@tycho.nsa.gov> <79e41bd9-2570-7386-d462-d242a18fb786@schaufler-ca.com> <1513271755.18008.11.camel@tycho.nsa.gov> From: Casey Schaufler Message-ID: <33e2eb10-4acc-f9db-a87d-ce63b5b48c1e@schaufler-ca.com> Date: Thu, 14 Dec 2017 09:42:28 -0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <1513271755.18008.11.camel@tycho.nsa.gov> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9731 Lines: 241 On 12/14/2017 9:15 AM, Stephen Smalley wrote: > On Thu, 2017-12-14 at 09:00 -0800, Casey Schaufler wrote: >> On 12/14/2017 8:42 AM, Stephen Smalley wrote: >>> On Thu, 2017-12-14 at 08:18 -0800, Casey Schaufler wrote: >>>> On 12/13/2017 7:18 AM, Stephen Smalley wrote: >>>>> On Wed, 2017-12-13 at 09:25 +0000, yangjihong wrote: >>>>>> Hello,  >>>>>> >>>>>> I am doing stressing testing on 3.10 kernel(centos 7.4), to >>>>>> constantly starting numbers of docker ontainers with selinux >>>>>> enabled, >>>>>> and after about 2 days, the kernel softlockup panic: >>>>>>    [] sched_show_task+0xb8/0x120 >>>>>>  [] show_lock_info+0x20f/0x3a0 >>>>>>  [] watchdog_timer_fn+0x1da/0x2f0 >>>>>>  [] ? >>>>>> watchdog_enable_all_cpus.part.4+0x40/0x40 >>>>>>  [] __hrtimer_run_queues+0xd2/0x260 >>>>>>  [] hrtimer_interrupt+0xb0/0x1e0 >>>>>>  [] local_apic_timer_interrupt+0x37/0x60 >>>>>>  [] smp_apic_timer_interrupt+0x50/0x140 >>>>>>  [] apic_timer_interrupt+0x6d/0x80 >>>>>>    [] ? >>>>>> sidtab_context_to_sid+0xb3/0x480 >>>>>>  [] ? sidtab_context_to_sid+0x110/0x480 >>>>>>  [] ? mls_setup_user_range+0x145/0x250 >>>>>>  [] security_get_user_sids+0x3f7/0x550 >>>>>>  [] sel_write_user+0x12b/0x210 >>>>>>  [] ? sel_write_member+0x200/0x200 >>>>>>  [] selinux_transaction_write+0x48/0x80 >>>>>>  [] vfs_write+0xbd/0x1e0 >>>>>>  [] SyS_write+0x7f/0xe0 >>>>>>  [] system_call_fastpath+0x16/0x1b >>>>>> >>>>>> My opinion: >>>>>> when the docker container starts, it would mount overlay >>>>>> filesystem >>>>>> with different selinux context, mount point such as:  >>>>>> overlay on >>>>>> /var/lib/docker/overlay2/be3ef517730d92fc4530e0e952eae4f6cb0f >>>>>> 07b4 >>>>>> bc32 >>>>>> 6cb07495ca08fc9ddb66/merged type overlay >>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t: >>>>>> s0:c >>>>>> 414, >>>>>> c873",lowerdir=/var/lib/docker/overlay2/l/Z4U7WY6ASNV5CFWLADP >>>>>> ARHH >>>>>> WY7: >>>>>> /var/lib/docker/overlay2/l/V2S3HOKEFEOQLHBVAL5WLA3YLS:/var/li >>>>>> b/do >>>>>> cker >>>>>> /overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdir=/var/lib/dock >>>>>> er/o >>>>>> verl >>>>>> ay2/be3ef517730d92fc4530e0e952eae4f6cb0f07b4bc326cb07495ca08f >>>>>> c9dd >>>>>> b66/ >>>>>> diff,workdir=/var/lib/docker/overlay2/be3ef517730d92fc4530e0e >>>>>> 952e >>>>>> ae4f >>>>>> 6cb0f07b4bc326cb07495ca08fc9ddb66/work) >>>>>> shm on >>>>>> /var/lib/docker/containers/9fd65e177d2132011d7b422755793449c9 >>>>>> 1327 >>>>>> ca57 >>>>>> 7b8f5d9d6a4adf218d4876/shm type tmpfs >>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s >>>>>> virt >>>>>> _san >>>>>> dbox_file_t:s0:c414,c873",size=65536k) >>>>>> overlay on >>>>>> /var/lib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb >>>>>> 7258 >>>>>> cbca >>>>>> 14ff6d165b94353eefab/merged type overlay >>>>>> (rw,relatime,context="system_u:object_r:svirt_sandbox_file_t: >>>>>> s0:c >>>>>> 431, >>>>>> c651",lowerdir=/var/lib/docker/overlay2/l/3MQQXB4UCLFB7ANVRHP >>>>>> AVRC >>>>>> RSS: >>>>>> /var/lib/docker/overlay2/l/46YGYO474KLOULZGDSZDW2JPRI,upperdi >>>>>> r=/v >>>>>> ar/l >>>>>> ib/docker/overlay2/38d1544d080145c7d76150530d0255991dfb7258cb >>>>>> ca14 >>>>>> ff6d >>>>>> 165b94353eefab/diff,workdir=/var/lib/docker/overlay2/38d1544d >>>>>> 0801 >>>>>> 45c7 >>>>>> d76150530d0255991dfb7258cbca14ff6d165b94353eefab/work) >>>>>> shm on >>>>>> /var/lib/docker/containers/662e7f798fc08b09eae0f0f944537a4bce >>>>>> dc1d >>>>>> cf05 >>>>>> a65866458523ffd4a71614/shm type tmpfs >>>>>> (rw,nosuid,nodev,noexec,relatime,context="system_u:object_r:s >>>>>> virt >>>>>> _san >>>>>> dbox_file_t:s0:c431,c651",size=65536k) >>>>>> >>>>>> sidtab_search_context check the context whether is in the >>>>>> sidtab >>>>>> list, If not found, a new node is generated and insert into >>>>>> the >>>>>> list, >>>>>> As the number of containers is increasing,  context nodes are >>>>>> also >>>>>> more and more, we tested the final number of nodes reached >>>>>> 300,000 +, >>>>>> sidtab_context_to_sid runtime needs 100-200ms, which will >>>>>> lead to >>>>>> the >>>>>> system softlockup. >>>>>> >>>>>> Is this a selinux bug? When filesystem umount, why context >>>>>> node >>>>>> is >>>>>> not deleted?  I cannot find the relevant function to delete >>>>>> the >>>>>> node >>>>>> in sidtab.c >>>>>> >>>>>> Thanks for reading and looking forward to your reply. >>>>> So, does docker just keep allocating a unique category set for >>>>> every >>>>> new container, never reusing them even if the container is >>>>> destroyed?  >>>>> That would be a bug in docker IMHO.  Or are you creating an >>>>> unbounded >>>>> number of containers and never destroying the older ones? >>>> You can't reuse the security context. A process in ContainerA >>>> sends >>>> a labeled packet to MachineB. ContainerA goes away and its >>>> context >>>> is recycled in ContainerC. MachineB responds some time later, >>>> again >>>> with a labeled packet. ContainerC gets information intended for >>>> ContainerA, and uses the information to take over the Elbonian >>>> government. >>> Docker isn't using labeled networking (nor is anything else by >>> default; >>> it is only enabled if explicitly configured). >> If labeled networking weren't an issue we'd have full security >> module stacking by now. Yes, it's an edge case. If you want to >> use labeled NFS or a local filesystem that gets mounted in each >> container (don't tell me that nobody would do that) you've got >> the same problem. > Even if someone were to configure labeled networking, Docker is not > presently relying on that or SELinux network enforcement for any > security properties, so it really doesn't matter. True enough. I can imagine a use case, but as you point out, it would be a very complex configuration and coordination exercise using SELinux. > And if they wanted > to do that, they'd have to coordinate category assignments across all > systems involved, for which no facility exists AFAIK. If you have two > docker instances running on different hosts, I'd wager that they can > hand out the same category sets today to different containers. > > With respect to labeled NFS, that's also not the default for nfs > mounts, so again it is a custom configuration and Docker isn't relying > on it for any guarantees today. For local filesystems, they would > normally be context-mounted or using genfscon rather than xattrs in > order to be accessible to the container, thus no persistent storage of > the category sets. I know that is the intended configuration, but I see people do all sorts of stoopid things for what they believe are good reasons. Unfortunately, lots of people count on containers to provide isolation, but create "solutions" for data sharing that defeat it. > Certainly docker could provide an option to not reuse category sets, > but making that the default is not sane and just guarantees exhaustion > of the SID and context space (just create and tear down lots of > containers every day or more frequently). It seems that Docker might have a similar issue with UIDs, but it takes longer to run out of UIDs than sidtab entries. > >>>>> On the selinux userspace side, we'd also like to eliminate the >>>>> use >>>>> of >>>>> /sys/fs/selinux/user (sel_write_user -> security_get_user_sids) >>>>> entirely, which is what triggered this for you. >>>>> >>>>> We cannot currently delete a sidtab node because we have no way >>>>> of >>>>> knowing if there are any lingering references to the >>>>> SID.  Fixing >>>>> that >>>>> would require reference-counted SIDs, which goes beyond just >>>>> SELinux >>>>> since SIDs/secids are returned by LSM hooks and cached in other >>>>> kernel >>>>> data structures. >>>> You could delete a sidtab node. The code already deals with >>>> unfindable >>>> SIDs. The issue is that eventually you run out of SIDs. Then you >>>> are >>>> forced to recycle SIDs, which leads to the overthrow of the >>>> Elbonian >>>> government. >>> We don't know when we can safely delete a sidtab node since SIDs >>> aren't >>> reference counted and we can't know whether it is still in use >>> somewhere in the kernel.  Doing so prematurely would lead to the >>> SID >>> being remapped to the unlabeled context, and then likely to >>> undesired >>> denials. >> I would suggest that if you delete a sidtab node and someone >> comes along later and tries to use it that denial is exactly >> what you would desire. I don't see any other rational action. > Yes, if we know that the SID wasn't in use at the time we tore it down. > But if we're just randomly deleting sidtab entries based on age or > something (since we have no reference count), we'll almost certainly > encounter situations where a SID hasn't been accessed in a long time > but is still being legitimately cached somewhere. Just a file that > hasn't been accessed in a while might have that SID still cached in its > inode security blob, or anywhere else. > >>>>> sidtab_search_context() could no doubt be optimized for the >>>>> negative >>>>> case; there was an earlier optimization for the positive case >>>>> by >>>>> adding >>>>> a cache to sidtab_context_to_sid() prior to calling it.  It's a >>>>> reverse >>>>> lookup in the sidtab. >>>> This seems like a bad idea. >>> Not sure what you mean, but it can certainly be changed to at least >>> use >>> a hash table for these reverse lookups. >>> >>> >