Received: by 10.223.185.82 with SMTP id b18csp68324wrg; Thu, 8 Mar 2018 19:28:41 -0800 (PST) X-Google-Smtp-Source: AG47ELvpCO86TDy8gyZTkIocxWxQUysKucOYEk9bt2IWFTBzI18Y83N8l27g5fZ8Cx+0ZFllWZoi X-Received: by 10.98.150.212 with SMTP id s81mr29004827pfk.100.1520566121481; Thu, 08 Mar 2018 19:28:41 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520566121; cv=none; d=google.com; s=arc-20160816; b=WIuLQx4tozZ+PuQAEtK6mjfOHxayU5dsen5iryqeee1l6d7SVtdlLBtPBHO8Q1bYsC ZDVSRBDuhb8WoJfPZkjn3fNOaZrMrJb9twL/y8APAQgJguuPitZEuv1OC7t5AiRBqb8V Klsh/KPX5dEw0OJE2ND3vkI1KoQtgF2GBBjgl0ssYc4vGSWTL7i/r4ieKlMr1Ao41ja1 FGJUtzyQVrQpLInQMJ1+aTPc+XrUu0Dup08ISLmOErAgtRUmTXUrmVYzavwGeTlwtohd 98Nhbe1tPUIcs0RSPHKDQezth+VjX+qx4TZxpjUqH1muGJoMPeGLdErYmQJpVvSWMmEy YVTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=Vy8Oqp90fGvSJ4yLDooE1taTf+ArKyx/CW9YBfqg7kI=; b=AC3ck0JS5kWSg+j48KNM7bKq/n/Yu/k6L8bGuv9vx5HojZNi2F94Xg1pOAN3zp333e Mu75ksBbYzxPdg9cdEkv9xfOuGfnvY6ZXGCYg5/B+ztLi+WS47bH1TCYC5S9v2v2mFwc KkFrWqEEIRUzJbmE70XCEomoX3zAjN0nIhx6tv5gZecbwmkwfUnUtu7ntRHcw8gojEaj 81eFkDnpvm7tWUF179U8rSOe0oLRDtSPzF2Vz+ti3BH5LZY8ZDvhgKcASze4buSP+tZx 2qdjrMuIrIzR9I/nH2qBuHHbfscEYT+F8h17jM2uF9NAHGnTqpFI8FplxI/23SnK+jLH i0hg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=tsJdIRxw; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y66si107286pff.331.2018.03.08.19.28.27; Thu, 08 Mar 2018 19:28:41 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=tsJdIRxw; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751242AbeCID1i (ORCPT + 99 others); Thu, 8 Mar 2018 22:27:38 -0500 Received: from mail-pg0-f68.google.com ([74.125.83.68]:46653 "EHLO mail-pg0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750948AbeCID1g (ORCPT ); Thu, 8 Mar 2018 22:27:36 -0500 Received: by mail-pg0-f68.google.com with SMTP id r26so3069170pgv.13; Thu, 08 Mar 2018 19:27:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=Vy8Oqp90fGvSJ4yLDooE1taTf+ArKyx/CW9YBfqg7kI=; b=tsJdIRxwGViN/neQZHBxiqNfffXYZr1srXC9GH9an53HOyiXiKetonlkipIgpXcQcf igVUfxxbC3mszutAE43IoQVj8xirhvbVgEPkLbitn0l+0JcD/pBrV8RDWM4Hmglv4tYv 8wZ+Iq6QrOXRG4vUpYlpKCfXYvnBO/ybeYOYax7tu8iUmqdPtZ7ABZGvzn/0ltgTT+4s NtdTlQfjZC7VtNcYH5jIISUcDbnKKORVJHifOTHd0FRKxkx5dNgHU+EA6L8QFLDPtgBY pS+1QW7TF8n0JGE7Lj4LJNJUnWcDT/63EosuJGH4pQiAJL0Jn0LYlIqCUnBwMJ7OWPjh 4vWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Vy8Oqp90fGvSJ4yLDooE1taTf+ArKyx/CW9YBfqg7kI=; b=IpFy0LQLoVgJV+XfzF4RNj6J0zktZ2+TtUENmvzktcyW+don6cP6jSlpZ8r9c9zZps Czu22q59q+42bIDIp65BmP2rGUk2n2tHP6V/UkRvkEXWukitazV+QOlMACwn+5eBYy0X a0KlYBH5/jJqKxjjsbgSybz06mSKFmGBmRfaHjEv7Bl8k4BAfiwIqynk5WpyWU9a3c6k XJy03Qq3sVFsxaKk3lQqQ78UXp2v4LnNObLuPnog7eflUPnmS/bi+eq8JuyXUmClcsP0 S+2BlCC6z/Rq5uJrGksfnuRCjf57LzgXd1FsQ0lT2mMAcheJuwjKCyTCO+Hz1utht55i 06bQ== X-Gm-Message-State: APf1xPANw3Xk7r7UktwzP/E4NZaE92O6sVz+m2koPjQ6mdTOGPhrNilf qax8ecOwvVUwqMFFeJUKVoE= X-Received: by 10.99.117.6 with SMTP id q6mr22830773pgc.146.1520566055511; Thu, 08 Mar 2018 19:27:35 -0800 (PST) Received: from ast-mbp ([2620:10d:c090:180::1:6026]) by smtp.gmail.com with ESMTPSA id h26sm154596pgv.22.2018.03.08.19.27.33 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 08 Mar 2018 19:27:34 -0800 (PST) Date: Thu, 8 Mar 2018 19:27:32 -0800 From: Alexei Starovoitov To: Andy Lutomirski Cc: Kees Cook , Alexei Starovoitov , Djalal Harouni , Al Viro , "David S. Miller" , Daniel Borkmann , Linus Torvalds , Greg KH , "Luis R. Rodriguez" , Network Development , LKML , kernel-team@fb.com, Linux API Subject: Re: [PATCH net-next] modules: allow modprobe load regular elf binaries Message-ID: <20180309032730.qaqsv3hc6t4wghxc@ast-mbp> References: <20180306013457.1955486-1-ast@kernel.org> <20180309012046.6kcivmzzkap3a4xc@ast-mbp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: NeoMutt/20170421 (1.8.2) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 09, 2018 at 02:12:24AM +0000, Andy Lutomirski wrote: > On Fri, Mar 9, 2018 at 1:20 AM, Alexei Starovoitov > wrote: > > On Fri, Mar 09, 2018 at 12:59:36AM +0000, Andy Lutomirski wrote: > >> > >> Alexei, can you give an example use case? I'm sure it's upthread > >> somewhere, but I'm having trouble finding it. > > > > at the time of iptable's setsockopt() the kernel will do > > err = request_module("bpfilter"); > > once. > > The rough POC code: > > https://git.kernel.org/pub/scm/linux/kernel/git/ast/bpf.git/tree/net/ipv4/bpfilter/sockopt.c?h=ipt_bpf#n25 > > Here's what I gather from reading that code: you have a new kernel > feature (consisting of actual kernel code) that wants to defer some of > its implementation to user mode. I like this idea a lot. But I have > a suggestion for a slightly different way of accomplishing the same > thing. Rather than extending init_module() to accept ELF input, > except the call_umh code to be able to call blobs. You'd use it it > very roughly like this: > > First, compile your user code and emit a staitc binary. Use objdump > fiddling or a trivial .S file to make that static binary into a > variable. Then write a tiny shim module like this: > > extern unsigned char __begin_user_code[], __end_user_code[]; > > int __init init_shim_module(void) > { > return call_umh_blob(__begin_user_code, __end_user_code - __begin_user_code); > } > > By itself, this is clearly a worse solution than yours, but it has two > benefits, one small and two big. The small benefit is that it is > completely invisible to userspace: the .ko file is a bona fide module. Unfortunately it's not quite the case. The normal .ko that does call_umh_blob is indeed seen in lsmod, but the umh process is a separate task. It could have been oomed or killed by admin and this normal .ko wouldn't notice it, so health check of umh process by the kernel is needed regardless. Right now bpfilter has trivial fuse-like protocol. This part is still to be designed cleanly. No doubt that visibility and debuggability into this umh processes is must have, but lsmod/rmmod interface doesn't quite fit. As you said letting this priv tasks register themselves in lsmod is certainly no-go. I think if they will be in lsmod, kernel has to register them and establish health check with umh at the same time. I think worrying about restarting is not necessary. This is still kernel code with the same high standards and review process. If they crash it's really a kernel bug. It only doesn't take the system down. > I think we don't want to end up in a situation where we ship a program > with a .ko extension that opens something in /dev, for example. this part I don't get. What's wrong with open of /dev ? I don't see a use case for it, but technically why not? > call_umh_blob() would create an anon_inode or similar object backed by > the blob and exec it. Interesting. I haven't considered such approach. For full context it all started from the idea of 'unprivileged kernel modules' or 'hardened kernel modules'. Something that kernel can easily interact with, but much safer than traditional kernel module. I've tried a bunch of crappy ideas first: 1. have a piece of kernel .text vm_mmap-ed into user process that doing iptables setsockopt and on return to user space force handle_signal to execute that code. Sort of like forced ld_preload where parasite code is provided by the kernel but runs in user space 2. have a special set of kernel page tables in read-only mode while iptable->bpf conversion is happening 3. have load_module() fork a user task and load real kernel .ko into it trying to hack #3 realized that I'm mainly copy-pasting a lot of load_elf_binary() code without elf_interpreter bits, so figured it's much easier and simpler to blend sys_finit_module with load_elf_binary via tweaking do_execveat_common and keeping that .ko as normal elf which is implemented in this patch. Debugging of that fake .ko is so much better. If it's done via call_umh_blob() the process that starts will indeed be a user mode process, but you won't be able to attach gdb to it. Whereas in this patch it's normal elf and standard debugging techniques apply. A developer can do gdb breakpoints, debug info, etc. That is huge advantage of keeping it as normal elf.