Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp1156645rdg; Wed, 11 Oct 2023 16:54:12 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEyAeN7+nsrEy5jogR7GTcy7YYMLpSalOETuoluNkS4hEPGd8HoPcdVFNvP4ZyGX5XoQsgB X-Received: by 2002:a05:6358:24a6:b0:134:d617:e2c9 with SMTP id m38-20020a05635824a600b00134d617e2c9mr27114558rwc.29.1697068452440; Wed, 11 Oct 2023 16:54:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697068452; cv=none; d=google.com; s=arc-20160816; b=ZEXwa0cFQ2pTDgJLjl59crXTQp/Bdfg0Cj9itWdGmNIvcx5RNZhN1ll32ixfhEBKlT ahaNBBASnckCLKeD8XKDlK+rGM3BAM1otij33CbD9FWdA/37gyxJJxZ98nZGerERIiTD EqCX8IIv25hVFH/Vyj8HtoLW61bd41a6WlbXFR9GPKcHfpbAgysfdhnlme2XhAWnpPfV NYjfwdcdcidbQ24jVTTZMT0Lvc6K0Xhut2wp4LGfGUIJYtHCzHhjEbuVOUZipNYm/Kjg fgeP3oTsRl/KGIdG/C4mUt2QPe24Cj3SstRIXK9TC/iSWKZCc3T7H7itKww+nBPDY/MB AklA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=PUtRXirdj/rui6epJsUlvt/EsmEY1Wgvv8OAOJHEF+4=; fh=Q09VU3a2o0OwwIyDL4hSCu/QhJbtIXJjUZfEC/k83lw=; b=uU2H7p2FJ9c0KzV6SOXkvR7OsUfGhpqxNmgEEMyr0Vtm5dG8kXC1Bk5WQ93qs875Ja LH4tBBXXGkzxJnagmxREopT9UUIUwr1tH9/BLMHeLp6rC0NGvYkrF4w/dlczFPfkgwJ6 SqZakyotdfp2effEtQt0tKAK2+ZolagFPBAoIdPa/3M5RBbvU/E+EOi2CS7jzyDkCyFl iy+/4GK9TvFFOm/gQyrLTrYcdZY6pfzHsrwPzXAX1JVpI1KwamuW6/Rmk+tUPmpgQXr+ VAOUErV3/nTE6EiXOk15tg1lKqqPVJVtOL8HY3FZuGeXP4dXGF2fTFMoTnKAMmTzNQBp FGfA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=F6HGtgeq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id b22-20020a639316000000b00584c5117901si899231pge.59.2023.10.11.16.54.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 16:54:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=F6HGtgeq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id A1FDE80ED9A0; Wed, 11 Oct 2023 16:54:09 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233850AbjJKXx7 (ORCPT + 99 others); Wed, 11 Oct 2023 19:53:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48742 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376579AbjJKXx6 (ORCPT ); Wed, 11 Oct 2023 19:53:58 -0400 Received: from mail-vk1-xa33.google.com (mail-vk1-xa33.google.com [IPv6:2607:f8b0:4864:20::a33]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7AA7A90; Wed, 11 Oct 2023 16:53:55 -0700 (PDT) Received: by mail-vk1-xa33.google.com with SMTP id 71dfb90a1353d-4a18f724d47so134504e0c.3; Wed, 11 Oct 2023 16:53:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697068434; x=1697673234; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=PUtRXirdj/rui6epJsUlvt/EsmEY1Wgvv8OAOJHEF+4=; b=F6HGtgeq3XUqZKdYelJ8qXfWwoE6AykAAjgW+L9+9490zKBfCmLbWTotqDXxsXhP+Y kS6XLojqTYPg7aGKvJztOQtbtwgnro5ZFDvvKU++SrUmpEFvb2HL2dRbpIbjLsLceihi SA8UXsgEnbEqgn5b7tdpFYsYQa6zyrMWEVztqPwmULjb00o8CKw9BAb6LsWcz263J/ie +XL/Zn7wuY175Bs+CddeVCPd7ihP6obiaurQ1KcsR3vMzrWDDhJIhHToMxQ+NbjHoK0g DGH7NBxbxovH6L0WQOMex7XoB8A3qg4LGhbgWAnyj3umg/59GPhlMAxJvvKypJgmhKUT NA8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697068434; x=1697673234; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PUtRXirdj/rui6epJsUlvt/EsmEY1Wgvv8OAOJHEF+4=; b=ExuE5cuNzxuwji+l7owd+McRAR93oE2l4iG3KwmM4+6mBOQ83mt+bOLlg7LCXpRHgF 9Hm5Lnjaz6wy6ujGdHZ0oApBbgEMYO36E4cv+EXF9yhiYRecEJCgZP+HLfsUjyxqPd39 /S7WwfZKq/WSBf5ncDwx4sIhdR1Bc4k9CKakYbUfG8UbmpMH6q+P5nm81rMeIr+Fbif4 o2haJ73vi6x4+z6tbPDDnIijYzeGQvxnZLr14AcwFXmGOIbaCpNv1AtWVO7zYyHvGIJ1 DrYecSq+NBk9hJeFkeYogluRS37WEu+wdGmeNZxXrTjxsdX54f4ftGaX3reOoTHkM2Za Kzzw== X-Gm-Message-State: AOJu0Yynl08vK3T4gk1N7drzVydUFtdgFandfNQzmtRADCWr7M0qAcn9 j/fpBq3Rc1+newrHYLsqbyVNTZhdPt5qfKmynpI= X-Received: by 2002:a1f:e641:0:b0:49d:e70:6258 with SMTP id d62-20020a1fe641000000b0049d0e706258mr16202662vkh.3.1697068434490; Wed, 11 Oct 2023 16:53:54 -0700 (PDT) MIME-Version: 1.0 References: <20230907204256.3700336-1-gpiccoli@igalia.com> <202310091034.4F58841@keescook> In-Reply-To: <202310091034.4F58841@keescook> From: Ryan Houdek Date: Wed, 11 Oct 2023 16:53:43 -0700 Message-ID: Subject: Re: [RFC PATCH 0/2] Introduce a way to expose the interpreted file with binfmt_misc To: Kees Cook Cc: "Guilherme G. Piccoli" , David Hildenbrand , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-dev@igalia.com, kernel@gpiccoli.net, ebiederm@xmission.com, oleg@redhat.com, yzaikin@google.com, mcgrof@kernel.org, akpm@linux-foundation.org, brauner@kernel.org, viro@zeniv.linux.org.uk, willy@infradead.org, dave@stgolabs.net, joshua@froggi.es Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=3.0 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_SBL_CSS, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Wed, 11 Oct 2023 16:54:09 -0700 (PDT) X-Spam-Level: ** On Mon, Oct 9, 2023 at 10:37=E2=80=AFAM Kees Cook w= rote: > > On Fri, Oct 06, 2023 at 02:07:16PM +0200, David Hildenbrand wrote: > > On 07.09.23 22:24, Guilherme G. Piccoli wrote: > > > Currently the kernel provides a symlink to the executable binary, in = the > > > form of procfs file exe_file (/proc/self/exe_file for example). But w= hat > > > happens in interpreted scenarios (like binfmt_misc) is that such link > > > always points to the *interpreter*. For cases of Linux binary emulato= rs, > > > like FEX [0] for example, it's then necessary to somehow mask that an= d > > > emulate the true binary path. > > > > I'm absolutely no expert on that, but I'm wondering if, instead of modi= fying > > exe_file and adding an interpreter file, you'd want to leave exe_file a= lone > > and instead provide an easier way to obtain the interpreted file. > > > > Can you maybe describe why modifying exe_file is desired (about which > > consumers are we worrying? ) and what exactly FEX does to handle that (= how > > does it mask that?). > > > > So a bit more background on the challenges without this change would be > > appreciated. > > Yeah, it sounds like you're dealing with a process that examines > /proc/self/exe_file for itself only to find the binfmt_misc interpreter > when it was run via binfmt_misc? > > What actually breaks? Or rather, why does the process to examine > exe_file? I'm just trying to see if there are other solutions here that > would avoid creating an ambiguous interface... > > -- > Kees Cook Hey there, FEX-Emu developer here. I can try and explain some of the issues= . First thing is that we should set the stage here that there is a fundamental discrepancy between how ELF interpreters are represented versus binfmt_misc interpreters when it comes to procfs exe. An ELF file today can either be static or dynamic, wit= h the dynamic ELF files having a program header called PT_INTERP which will tell = the kernel where its interpreter executable lives. In an x86-64 environment thi= s is likely to be something like /lib64/ld-linux-x86-64.so.2. Today, the Kern= el doesn't put the PT_INTERP handle into procfs exe, it instead uses the dynamic ELF that was originally launched. In contrast to how this behaviour works, a binfmt_misc interpreter file getting launched through execve may or may not have ELF header sections. But it is left up t= o the binfmt_misc handler to do whatever it may need. The kernel sets procfs exe to the binfmt_misc interpreter instead of the executable. This is fundamentally the contrasting behaviour that is trying to be improved. It seems like the this behaviour is an oversight of the original binfmt_misc implementation rather than any sort of ambition to ensure there is a difference. It's already ambiguous that the interface changes when executing an executable through binfmt_misc= . Some simple ways applications break: - Applications like chrome tend to relaunch themselves through execve with `/proc/self/exe` - Chrome does this. I think Flatpaks or AppImage applications do this? - There are definitely more that do this that I have noticed. - In the cover letter there was a link to Mesa, the OSS OpenGL/Vulkan drivers using this - This library uses this interface to find out what application is running for applying workarounds for application bugs. Plenty of historical applications that use the API badly or incorrectly and need specific driver workarounds for them. - Some applications may use this path to open their own executable path and= then mmap back in for doing tricky memory mirroring or dynamic linking of themselves. - Saw some old abandoned emulator software doing this. There's likely more uses that I haven't noticed from software using this interface. Onward to what FEX-Emu is and how it tries working around the issue with a fairly naive hack. FEX-Emu is an x86 and x86-64 CPU emulator that gets installed as a binfmt_misc interpreter. It then executes x86 and x86-64 ELF files on an Arm64 device as effectively a multi-arch capable fashion. It's lightweight in that all application processes and threads are just regular Arm64 processes and threads. This is similar to how qemu-user opera= tes. When processing system calls, FEX will intercept any call that consumes a pathname, it will then inspect that path name and if it is one of the ways it is possible to access procfs/exe then it redirects to the true x86/x86-64 executable. This is an attempt to behave like how if the ELF was executed without a binfmt_misc handler. Pathnames captured in FEX-Emu today: - /proc/self/exe - /proc//exe - /proc/thread-self/exe This is very fragile and doesn't cover the full range of how applications could access procfs. Applications could end up using the *at variants of syscalls with an FD that has /proc/self/ open. They could do simple tricks like `/proc/self/../self/exe` and it would side-step this check. It's a game of whack-a-mole and escalating overhead to try and close the gap purely due to, what appears to be, an oversight in how binfmt_misc and PT_INTERP is handled. Hopefully this explains why this is necessary and that reducing the differences between how PT_INTERP and binfmt_misc are represented is desired.