Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp3234789pxk; Mon, 21 Sep 2020 08:30:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwueMU1qqoANVD+AtYUcGnTJLMM6QqpQ7oCHrLOdUz5pS1qsinHMhirRNvkhhLIgo5C8n9k X-Received: by 2002:a05:6402:22fc:: with SMTP id dn28mr147823edb.365.1600702211283; Mon, 21 Sep 2020 08:30:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1600702211; cv=none; d=google.com; s=arc-20160816; b=bSiVs7+Wx3k+w0rFxGmgs6GF1Zrj/NzunvX3pIAsIeF5QBOPSAoocgTNyKniQM0v8I 6+2vI20g1igI7VUh7RzRoTn66mESkSUkde6b5NdCb+WuJWmYqLRpEaxGBXhE1niKrptU zE7TPuXLvmxkwL0vNCz0p/Id2MWTsYTSCD7pJdAG/uzvbLCdN+vPdxob1p3opkgzgQcF L5z/LVadTPwJlCmJZd6lV9LyR/rI53M4pH3tO4e/TCPQ7JJOWRS4+Tu9uzCa+uZzfP+j /j834v96yC+pbuQQ/dNv616cRPkw1eKMZoWzjODA18RRGyzVpwhC5vJm3jmUrGCgbLi5 CYIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=kw9zFapd9SW+zuhwhTuJ3eJHqglSccYNB3MuySBWWOw=; b=v2p6+4DprdHQfx3DPhP5/VMyWdbGKq5lJwZ5xITZTVKRlhj7Z22V+TkY5o53sg8EUM c83+9M2tePtRW0JZY1SSWZP4yuxmKCXbICKTOlGtjWqI+I3G8dagw28S7l+8nkIgSB/9 mE0ZA/Ammj6jAwJju2PI2LmudTODEpZmsTJcOy3t3Viov3JESmLuIcwxDI3hn7etB0MC pCmSpEWT0v5qWjUz0HRHO75Dln2N/XCXAtHb4aeukxwgSv4ab/Ezn71/9+QpAzqZCENc 7QNAR5wjeKW5Gl7ViqYr6eoeZshRPkscK4r7sArOizftb+ff9dgz4x9OBffV2+HZP4nz PxOQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=bk12Y9q2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c9si8663460edl.424.2020.09.21.08.29.47; Mon, 21 Sep 2020 08:30:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=bk12Y9q2; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727754AbgIUP2J (ORCPT + 99 others); Mon, 21 Sep 2020 11:28:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50538 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726537AbgIUP2I (ORCPT ); Mon, 21 Sep 2020 11:28:08 -0400 Received: from mail-pj1-x1042.google.com (mail-pj1-x1042.google.com [IPv6:2607:f8b0:4864:20::1042]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7FB5DC061755; Mon, 21 Sep 2020 08:28:08 -0700 (PDT) Received: by mail-pj1-x1042.google.com with SMTP id bw23so4830662pjb.2; Mon, 21 Sep 2020 08:28:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=kw9zFapd9SW+zuhwhTuJ3eJHqglSccYNB3MuySBWWOw=; b=bk12Y9q2B0Yk1G4hAcboUYdVUipOX+HSabhDsdDoCKeQHx4hzhs8ZeG8Ky6OzIRACn N9MQ2+LUmeXhucWq64AZK3D7AcmK45BP1ww0Si+5AFxsEouRn393+QfiTsvR2zh7N+Pu gbF59YFFdg3ceSCXB/QQu3aMi8CSeF5pT88rJE9eKevU+LcoIh1l1b5gxmjxTHIWGjGm gzeKWuKpgEr63E8eDJQKWbjNCSqlY+vkU60pSygLvNRgvioJPP+WY+85MfCmv0vZCu20 /WfpqPvF1PuXAKceJGN30HmWyzb1yv7v1p/+3EKmHadog59uqEehLONBKasaC4lHF3gV iKgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=kw9zFapd9SW+zuhwhTuJ3eJHqglSccYNB3MuySBWWOw=; b=p+XW0pkh99SU0rauz9J9E5WdkC/jcy/H6Jb3m2DPFnKCNUwMrhlVRGOCJol03Y9WJp QCu4YH4to7sAMSpIXeFL44hH3QLDgjEsKLWs3scPxArXCKdMbjc2dsqgLhv14manK+3W dtkF+x3whHx0c9PeTzw8YX7ftzfWZ+Q/CJJsGzBRjHyBufGTOWxnfRjOlR+t0jE3QtYI lrAmS9CpEh31Y2UMK/WGY4xHc7P5rvqopIrFuQmpqKu/8vPHbqipE9i3RkeJp9QLseuM YaclwzRbBgfbECYZCCtysadlIqH/Yr/ueoYt4FiHi1utmAP+JsBuxAvvuv0/C9wdjD3p ++tg== X-Gm-Message-State: AOAM5321T33X0oWcToj5OILAvBCHAMKsJ0VPCFvsjywhDu81SZrkk0c0 IEbEXPe/lQI2FegYNoNjLyqaCeszgEKSlAAiViE= X-Received: by 2002:a17:90b:4b82:: with SMTP id lr2mr287928pjb.184.1600702087931; Mon, 21 Sep 2020 08:28:07 -0700 (PDT) MIME-Version: 1.0 References: <20200921135115.GC3794348@cisco> In-Reply-To: <20200921135115.GC3794348@cisco> From: YiFei Zhu Date: Mon, 21 Sep 2020 10:27:56 -0500 Message-ID: Subject: Re: [RFC PATCH seccomp 0/2] seccomp: Add bitmap cache of arg-independent filter results that allow syscalls To: Tycho Andersen Cc: Linux Containers , Andrea Arcangeli , Giuseppe Scrivano , Kees Cook , YiFei Zhu , Tobin Feldman-Fitzthum , Dimitrios Skarlatos , Valentin Rothberg , Hubertus Franke , Jack Chen , Josep Torrellas , bpf@vger.kernel.org, Tianyin Xu , Andy Lutomirski , Will Drewry , Jann Horn , Aleksa Sarai , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 21, 2020 at 8:51 AM Tycho Andersen wrote: > One problem with a kernel config setting is that it's for all tasks. > While docker and systemd may make decsisions based on syscall number, > other applications may have more nuanced filters, and this cache would > yield incorrect results. > > You could work around this by making this a filter flag instead; > filter authors would generally know whether their filter results can > be cached and probably be motivated to opt in if their users are > complaining about slow syscall execution. > > Tycho Yielding incorrect results should not be possible. The purpose of the "emulator" (for the lack of a better term) is to determine whether the filter reads any syscall arguments. A read from a syscall argument must go through the BPF_LD | BPF_ABS instruction, where the 32 bit multiuse field "k" is an offset to struct seccomp_data. struct seccomp_data contains four components [1]: syscall number, architecture number, instruction pointer at the time of syscall, and syscall arguments. The syscall number is enumerated by the emulator. The arch number is treated by the cache as 'if arch number is different from cached arch number, flush cache' (this is in seccomp_cache_check). The last two (ip and args) are treated exactly the same way in this patch: if the filter loads the arguments at all, the syscall is marked non-cacheable for any architecture number. The struct seccomp_data is the only external thing the filter may access. It is also cBPF so it cannot contain maps to store special states between runs. Therefore a seccomp filter is pure function. If we know given some inputs (the syscall number and arch number) the function will not evaluate any other inputs before returning, then we can safely cache with just the inputs in concern. As for the overhead, on my x86_64 with gcc 10.2.0, seccomp_cache_check compiles into: if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) return false; 0xffffffff8120fdb3 <+99>: movsxd rdx,DWORD PTR [r12] 0xffffffff8120fdb7 <+103>: cmp edx,0x1b7 0xffffffff8120fdbd <+109>: ja 0xffffffff8120fdf9 <__seccomp_filter+169> if (unlikely(thread_data->last_filter != sfilter || thread_data->last_arch != sd->arch)) { 0xffffffff8120fdbf <+111>: mov rdi,QWORD PTR [rbp-0xb8] 0xffffffff8120fdc6 <+118>: lea rsi,[rax+0x6f0] 0xffffffff8120fdcd <+125>: cmp rdi,QWORD PTR [rax+0x728] 0xffffffff8120fdd4 <+132>: jne 0xffffffff812101f0 <__seccomp_filter+1184> 0xffffffff8120fdda <+138>: mov ebx,DWORD PTR [r12+0x4] 0xffffffff8120fddf <+143>: cmp DWORD PTR [rax+0x730],ebx 0xffffffff8120fde5 <+149>: jne 0xffffffff812101f0 <__seccomp_filter+1184> return test_bit(syscall_nr, thread_data->syscall_ok); 0xffffffff8120fdeb <+155>: bt QWORD PTR [rax+0x6f0],rdx 0xffffffff8120fdf3 <+163>: jb 0xffffffff8120ffb7 <__seccomp_filter+615> [... unlikely path of cache flush omitted] and seccomp_cache_insert compiles into: if (unlikely(syscall_nr < 0 || syscall_nr >= NR_syscalls)) return; 0xffffffff8121021b <+1227>: movsxd rax,DWORD PTR [r12] 0xffffffff8121021f <+1231>: cmp eax,0x1b7 0xffffffff81210224 <+1236>: ja 0xffffffff8120ffb7 <__seccomp_filter+615> if (!test_bit(syscall_nr, sfilter->cache.syscall_ok)) return; 0xffffffff8121022a <+1242>: mov rbx,QWORD PTR [rbp-0xb8] 0xffffffff81210231 <+1249>: mov rdx,QWORD PTR gs:0x17000 0xffffffff8121023a <+1258>: bt QWORD PTR [rbx+0x108],rax 0xffffffff81210242 <+1266>: jae 0xffffffff8120ffb7 <__seccomp_filter+615> set_bit(syscall_nr, thread_data->syscall_ok); 0xffffffff81210248 <+1272>: lock bts QWORD PTR [rdx+0x6f0],rax 0xffffffff81210251 <+1281>: jmp 0xffffffff8120ffb7 <__seccomp_filter+615> In the circumstance of a non-cacheable syscall happening over and over, the code path would go through the syscall_nr bound check, then the filter flush check, then the test_bit, then another syscall_nr bound check and one more test_bit in seccomp_cache_insert. Considering that they are either stack variables, elements of current task_struct, and elements of the filter struct, I imagine they would well be in the CPU data cache and not incur much overhead. The CPU is also free to branch predict and reorder memory accesses (there are no hardware memory barriers here) to further increase the efficiency, whereas a normal filter execution would be impaired by things like retpoline. If one were to add an additional flag for does-userspace-want-us-to-cache, it would still be a member of the filter struct. What would be loaded into the CPU data cache originally would still be loaded. Correct me if I'm wrong, but I don't think that check will reduce any significant overhead of the seccomp cache itself. That said, I have not profiled the exact number of milliseconds this patch would incur to uncacheable syscalls, I can report back with numbers if you would like to see. Does that answer your concern? YiFei Zhu [1] https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/seccomp.h#L60