Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp6948003rwr; Wed, 10 May 2023 01:25:38 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6DYvS3TlYksLkyu9IlyW3ZIrmj0PK9yXkyePDmolgHBgsiChuwp3HZeTD/oxlBDolffZfP X-Received: by 2002:a05:6a20:4314:b0:fa:4fc6:79b2 with SMTP id h20-20020a056a20431400b000fa4fc679b2mr20358616pzk.11.1683707138373; Wed, 10 May 2023 01:25:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683707138; cv=none; d=google.com; s=arc-20160816; b=ULPs5KX4RXnrbtb9DGP12Wb/fGjCHGvdWutwegjTiVwhmdE82rL64A3+yuQJd8mc6n akRPbafW0ptaeXa/yDSujTooZ6OntnJXr3n2Se/FIJTVwSmIRqUSCfDy9B/g6FC/Guh1 t6FBxuH4bL5RFOZ+3Lj4Y7OzmBXiTErywC0e8C7+kYS59jHC47Js+tmq86mMUoc9orjN Fz/LSk6n2CtobqgSnq0p2TT57cjYE9a9H6nMvPrp0eu4mniT8/Hf6YmqDalJ5ga6QqaN vEkApjmfdaLH4s2ErMMnznz0ZHJ/2eAoud9LGFMKdBwBmln4eHeia9coSKLIkB7CvgrE MBhA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=7UhletGxunbOC13uZt4Q0tdTLfFVMTapJz49NoVVQKk=; b=g+qS9e1bOZIHXQ6Ms7PJB91RcK7tKBbJ8LblzvOMEQLKME3GpgnDF/1brIVCkjUxt2 aWfqiGyj5dD2IVoJcTcwA6YBE0FmFGSXRvCqPKqKiHZjmgiNQ+AwisC+A/HNHxbKElZn AsV3oOC7KvMeL9O9IgggHSIk0xVO6xOFtTyrQW6LmVwXcoJQ4gJiImvaUNR4Yr5JTEIH 6ihiiJoiCKVrUv1UeTjQWjap9oXIk4rQyc3vi5/dICJISIkicgeIu+mDEVtBkSNTPt+L vyNpyad/GRRb63xxtqC75NDxLTuuBYelPF7CBvnjRYnTpVZsfWZrvm3DXPwO9oFnsniN qTTg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Uojg3pRQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h191-20020a6383c8000000b00527d158ec6fsi3578423pge.37.2023.05.10.01.25.23; Wed, 10 May 2023 01:25:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Uojg3pRQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236295AbjEJIQO (ORCPT + 99 others); Wed, 10 May 2023 04:16:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236247AbjEJIQL (ORCPT ); Wed, 10 May 2023 04:16:11 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D60281BC0 for ; Wed, 10 May 2023 01:16:06 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 5A9A962EE3 for ; Wed, 10 May 2023 08:16:06 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 84549C4331D for ; Wed, 10 May 2023 08:16:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1683706565; bh=2+isg+BERD3Bcri0FsyryjsGR7pNNF6dlwowyBGKzec=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=Uojg3pRQRGb3JyZIXWB1lhETfstrc1AzNDHW+0eEjEPD8W5m2XoKUEl4HbxtreNLv tZDBOyO4n0REfP1FXd9QIgu8kCTdfzWY/AxjB8yqZmCxJ2+WxlEms687I7x4v+rkoj S0ezNqIl9w2ENqvGN68ctpmef0dE/FKavszSoVy3KDRenoyPMH3Q5YiON2RND5YdCM fyxNwRaHTixaBCKTPAmXGoeOeY1tn3WXDYjLp/Pq4LCmpk/Tgr0JOr+AERgn19KY3e V8SFt68Q/SBKnJnIOoI3XBNK6Wb07kiYi9Fz3HjJNTs9CaZIFI+WHMPryZp+/uoUpP 7olCKMZ9+dZPw== Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-2ac7de2b72fso77905381fa.1 for ; Wed, 10 May 2023 01:16:05 -0700 (PDT) X-Gm-Message-State: AC+VfDw41cugwPLjdCIKgqoTw6mxV+Y7Rb/gLN9wKqmGWjTThMV8GRCX rCUw17ljPNAQa5HkTEp86gg2B8ABK6Y4HmtgmX0= X-Received: by 2002:ac2:597a:0:b0:4f0:441:71a4 with SMTP id h26-20020ac2597a000000b004f0044171a4mr1422762lfp.35.1683706563235; Wed, 10 May 2023 01:16:03 -0700 (PDT) MIME-Version: 1.0 References: <20230508083223.GA116442@k08j02272.eu95sqa> <20230510070949.GA7127@k08j02272.eu95sqa> In-Reply-To: <20230510070949.GA7127@k08j02272.eu95sqa> From: Ard Biesheuvel Date: Wed, 10 May 2023 10:15:51 +0200 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH RFC 31/43] x86/modules: Adapt module loading for PIE support To: Hou Wenlong Cc: linux-kernel@vger.kernel.org, Lai Jiangshan , Kees Cook , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Peter Zijlstra , Petr Mladek , Greg Kroah-Hartman , "Jason A. Donenfeld" , Song Liu , Julian Pidancet Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 10 May 2023 at 09:15, Hou Wenlong wrote: > > On Mon, May 08, 2023 at 05:16:34PM +0800, Ard Biesheuvel wrote: > > On Mon, 8 May 2023 at 10:38, Hou Wenlong wrote: > > > > > > On Sat, Apr 29, 2023 at 03:29:32AM +0800, Ard Biesheuvel wrote: > > > > On Fri, 28 Apr 2023 at 10:53, Hou Wenlong wrote: > > > > > > > > > > Adapt module loading to support PIE relocations. No GOT is generared for > > > > > module, all the GOT entry of got references in module should exist in > > > > > kernel GOT. Currently, there is only one usable got reference for > > > > > __fentry__(). > > > > > > > > > > > > > I don't think this is the right approach. We should permit GOTPCREL > > > > relocations properly, which means making them point to a location in > > > > memory that carries the absolute address of the symbol. There are > > > > several ways to go about that, but perhaps the simplest way is to make > > > > the symbol address in ksymtab a 64-bit absolute value (but retain the > > > > PC32 references for the symbol name and the symbol namespace name). > > > > That way, you can always resolve such GOTPCREL relocations by pointing > > > > it to the ksymtab entry. Another option would be to take inspiration > > > > from the PLT code we have on ARM and arm64 (and other architectures, > > > > surely) and to count the GOT based relocations, allocate some extra > > > > r/o module space for each, and allocate slots and populate them with > > > > the right value as you fix up the relocations. > > > > > > > > Then, many such relocations can be relaxed at module load time if the > > > > symbol is in range. IIUC, the module and kernel will still be inside > > > > the same 2G window even after widening the KASLR range to 512G, so > > > > most GOT loads can be converted into RIP relative LEA instructions. > > > > > > > > Note that this will also permit you to do things like > > > > > > > > #define PV_VCPU_PREEMPTED_ASM \ > > > > "leaq __per_cpu_offset(%rip), %rax \n\t" \ > > > > "movq (%rax,%rdi,8), %rax \n\t" \ > > > > "addq steal_time@GOTPCREL(%rip), %rax \n\t" \ > > > > "cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "(%rax) \n\t" \ > > > > "setne %al\n\t" > > > > > > > > or > > > > > > > > +#ifdef CONFIG_X86_PIE > > > > + " pushq arch_rethook_trampoline@GOTPCREL(%rip)\n" > > > > +#else > > > > " pushq $arch_rethook_trampoline\n" > > > > +#endif > > > > > > > > instead of having these kludgy push/pop sequences to free up temp registers. > > > > > > > > (FYI I have looked into this PIE linking just a few weeks ago [0] so > > > > this is all rather fresh in my memory) > > > > > > > > > > > > > > > > > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=x86-pie > > > > > > > > > > > Hi Ard, > > > Thanks for providing the link, it has been very helpful for me as I am > > > new to the topic of compilers. > > > > Happy to hear that. > > > > > One key difference I noticed is that you > > > linked the kernel with "-pie" instead of "--emit-reloc". I also noticed > > > that Thomas' initial patchset[0] used "-pie", but in RFC v3 [1], it > > > switched to "--emit-reloc" in order to reduce dynamic relocation space > > > on mapped memory. > > > > > > > The problem with --emit-relocs is that the relocations emitted into > > the binary may get out of sync with the actual code after the linker > > has applied relocations. > > > > $ cat /tmp/a.s > > foo:movq foo@GOTPCREL(%rip), %rax > > > > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s > > ard@gambale:~/linux$ x86_64-linux-gnu-objdump -dr /tmp/a.o > > > > /tmp/a.o: file format elf64-x86-64 > > > > > > Disassembly of section .text: > > > > 0000000000000000 : > > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 > > 3: R_X86_64_REX_GOTPCRELX foo-0x4 > > > > $ x86_64-linux-gnu-gcc -c -o /tmp/a.o /tmp/a.s > > $ x86_64-linux-gnu-objdump -dr /tmp/a.o > > 0000000000000000 : > > 0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 > > 3: R_X86_64_REX_GOTPCRELX foo-0x4 > > > > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles > > -Wl,-no-pie,-q,--defsym,_start=0x0 /tmp/a.s > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf > > 0000000000401000 : > > 401000: 48 c7 c0 00 10 40 00 mov $0x401000,%rax > > 401003: R_X86_64_32S foo > > > > $ x86_64-linux-gnu-gcc -o /tmp/a.elf -nostartfiles > > -Wl,-q,--defsym,_start=0x0 /tmp/a.s > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf > > 0000000000001000 : > > 1000: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 1000 > > 1003: R_X86_64_PC32 foo-0x4 > > > > This all looks as expected. However, when using Clang, we end up with > > > > $ clang -target x86_64-linux-gnu -o /tmp/a.elf -nostartfiles > > -fuse-ld=lld -Wl,--relax,-q,--defsym,_start=0x0 /tmp/a.s > > $ x86_64-linux-gnu-objdump -dr /tmp/a.elf > > 00000000000012c0 : > > 12c0: 48 8d 05 f9 ff ff ff lea -0x7(%rip),%rax # 12c0 > > 12c3: R_X86_64_REX_GOTPCRELX foo-0x4 > > > > So in this case, what --emit-relocs gives us is not what is actually > > in the binary. We cannot just ignore these either, given that they are > > treated differently depending on whether the symbol is a per-CPU > > symbol or not - in the former case, we need to perform a fixup if the > > relaxed reference is RIP relative, and in the latter case, if the > > relaxed reference is absolute. > > > > On top of that, --emit-relocs does not cover the GOT, so we'd still > > need to process that from the code explicitly. > > > > In general, relying on --emit-relocs is kind of dodgy, and I think > > combining PIE linking with --emit-relocs is a bad idea. > > > > > The another issue is that it requires the addition of the > > > "-mrelax-relocations=no" option to support older compilers and linkers. > > > > Why? The decompressor is now linked in PIE mode so we should be able > > to drop that. Or do you need to add is somewhere else? > > > Hi Ard, > > After removing the "-mrelax-relocations=no" option, I noticed that the > linker was relaxing GOT references as absolute references for mov > instructions, even if the symbol was in a high address, as long as I > kept the compile-time base address of the kernel image in the top 2G. I > consulted the "Optimize GOTPCRELX Relocations" chapter in x86-64 psABI, > which stated that "When position-independent code is disabled and foo is > defined locally in the lower 32-bit address space, memory operand in mov > can be converted into immediate operand". However, it seemed that if the > symbol was in the higher 32-bit address space, the memory operand in mov > would also be converted into an immediate operand. If I decreased the > compile-time base address of the kernel image, it would be relaxed as > lea. Therefore, I believe that using "-mrelax-relocations=no" without > "-pie" option is necessary. Indeed. As you noted, the linker assumes that non-PIE linked binaries will always appear at their link time address, and relaxations will try to take advantage of that. Currently, we use -pie linking only for the decompressor, and we should be able to drop -mrelax-relocations=no from its LDFLAGS. But position dependent linking should not use relaxations at all. > Is there a way to force the linker to relax > it as lea without using the "-pie" option when linking? > Not that I am aware of. > Since all GOT references cannot be omitted, perhaps I should try linking > the kernel with the "-pie" option. > That way, we will end up with two sets of relocations, the static ones from --emit-relocs and the dynamic ones from -pie. This should be manageable, given that the difference between those sets should exactly cover the GOT. However, relying on --emit-relocs and -pie at the same time seems clumsy to me. I'd prefer to only depend on -pie at /some/ point. -- Ard.