Received: by 10.213.65.68 with SMTP id h4csp45404imn; Mon, 19 Mar 2018 19:02:43 -0700 (PDT) X-Google-Smtp-Source: AG47ELuBreZFcZTOHC4Q8s8RSwjkY07dLV/fQO5qrHHyRW2FjYOcUCBe9kkqC2CHBDWYFm/XQjzO X-Received: by 10.99.109.72 with SMTP id i69mr10892378pgc.417.1521511363002; Mon, 19 Mar 2018 19:02:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521511362; cv=none; d=google.com; s=arc-20160816; b=j82GU4dm56JAPOMOJQdkalFaCLWjJnRsus1cEr1JbHPwKahvEriQLsUBf3WOle+qsm Cyi49z+OitX/uQEW6VkxI7fuYJJ3V/tT25y3IyogxsAwZa41btgXIGGJy9UAWLVOpMbq rEOMmM5p3sanfumYS8nxnrycoscfbOUEycyxnEiAmBTYF0tNVbY6Lf3k0X1GYUP5hRIK PE7lFLYzp0aLXZ4pVu/HbknLgVx04Td5p9lX3yLVncWBoUyOJ+kdXzWKjpfsiIMLoOx/ LFbnPsb3UIaknzJrQR0B5gN0YXvcqRTmRAe0ObKQ3o5dy0YvNReK659umCWvHA8Kf2LO oxhg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=JqvzyFIn/D8HEyJu25GH0Oqz7Qs5HePMp0uug5z8B9k=; b=hPJvfJNQCOWvVzzUr68uycCu05ST7/kQLntuHhEG7vE9k9k8QWwJ6pYYYJVLW4SzKO 3Py5MUj717Bes/RoVK5dB6fAwVnYC4CpdqSyERmCZ6hXYFeOAWj59286mSEBflsMrZsu fZFjlCnzIZSkfpf405pTGcVFZrvl/YUUk9YF8YWOqmUaoKVbkdGwEfPj/TXKNsFccFD6 ju/Ab2Rt4zde1ITuaoxvzHJ7dSZ1clja+hNyEDyUN9O4tk48RUJ7+FLxUGdxuDw9nHPG I1u1cVO5/mifSsGcA2bxYOaK4rlbkgr+MsTOOSySJ63qexQSZ+ZyJSll/LFdsZyGoEbJ jTNg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w128si402032pfw.415.2018.03.19.19.02.29; Mon, 19 Mar 2018 19:02:42 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965016AbeCSXYE (ORCPT + 99 others); Mon, 19 Mar 2018 19:24:04 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:60240 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964824AbeCSXX5 (ORCPT ); Mon, 19 Mar 2018 19:23:57 -0400 Received: from viro by ZenIV.linux.org.uk with local (Exim 4.87 #1 (Red Hat Linux)) id 1ey47y-0002YP-Ip; Mon, 19 Mar 2018 23:23:42 +0000 Date: Mon, 19 Mar 2018 23:23:42 +0000 From: Al Viro To: Ingo Molnar Cc: Linus Torvalds , Dominik Brodowski , Linux Kernel Mailing List , Arnd Bergmann , linux-arch , Ralf Baechle , James Hogan , linux-mips , Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman , ppc-dev , Martin Schwidefsky , Heiko Carstens , linux-s390 , "David S . Miller" , sparclinux@vger.kernel.org, Ingo Molnar , Jiri Slaby , the arch/x86 maintainers Subject: Re: [RFC PATCH 4/6] mm: provide generic compat_sys_readahead() implementation Message-ID: <20180319232342.GX30522@ZenIV.linux.org.uk> References: <20180318161056.5377-1-linux@dominikbrodowski.net> <20180318161056.5377-5-linux@dominikbrodowski.net> <20180318174014.GR30522@ZenIV.linux.org.uk> <20180318181848.GU30522@ZenIV.linux.org.uk> <20180319042300.GW30522@ZenIV.linux.org.uk> <20180319092920.tbh2xwkruegshzqe@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180319092920.tbh2xwkruegshzqe@gmail.com> User-Agent: Mutt/1.9.1 (2017-09-22) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 19, 2018 at 10:29:20AM +0100, Ingo Molnar wrote: > > * Al Viro wrote: > > > On Sun, Mar 18, 2018 at 06:18:48PM +0000, Al Viro wrote: > > > > > I'd done some digging in that area, will find the notes and post. > > > > OK, found: > > Very nice writeup - IMHO this should go into Documentation/! If you want to turn that into something printable - more power to you... FWIW, I think we need to require per-architecture descriptions of ABI for all architectures. Something along the lines of alpha: C ABI: 64bit, location sequence is ($16, $17, $18, $19, $20, $21, stack) No arg padding (as for all 64bit). Stack pointer in $30, return value in $0. Syscall ABI: syscall number in $0, arg slots filled from $16, $17, $18, $19, $20, $21. Return value in $0, error is reported as 1 in $19. Saved syscall number is used as a flag for __force_successful_syscall_return() purposes - sticking 0 there inhibits the effect of negative return value. arm: C ABI: 32bit, location sequence is (r0, r1, r2, r3, stack). Arg padding for 64bit args: to even slot. Stack pointer in sp, return value in r0 Syscall ABI, EABI variant: syscall number in r7. Syscall ABI, OABI variant: syscall number encoded into insn. Syscall ABI (both variants): arg slots filled from r0, r1, r2, r3, r4, r5. Return value in r0. It's not quite a biarch (support of e.g. ioctl handling is absent, etc.; basic syscalls are handled, but that's it). etc. Ideally the information about callee-saved registers, syscall restart logics, etc. should also go there. I'm sick and tired of digging though the asm glue of unfamiliar architectures ;-/ Another relevant piece of information (especially for biarch) is how should sub-word arguments be normalized. E.g. on amd64 both int and long are passed in 64bit words and function that expects an int does *not* care about the upper 32 bits. If you have long f(int a) {return a;}, it will sign-extend the argument. On ppc, OTOH, it won't - the caller is responsible for having the bits 31..63 all equal. That used to be a source of considerable PITA - e.g. kill(2) used to require a compat wrapper on ppc. These days SYSCALL_DEFINE and COMPAT_SYSCALL_DEFINE glue takes care of normalizations. However it doesn't apply for the stuff that does *not* use ...DEFINE and for use of native syscalls on biarch we need a bit more. Consider e.g. 32bit syscall on sparc64 wanting to use the native counterpart. Arguments that are <= 32bit in both ABIs are fine - normalizations will take care of them. Anything that is 64bit in both ABIs means that we will need compat anyway - the argument needs to be recombined from two registers into one. The headache comes from * signed long * unsigned long * pointers Those are word-sized and we need to normalize. Solution before SYSCALL_DEFINE glue: have upper halves forcibly zeroed on entry (which normalizes unsigned long and pointers) and then sign-extend every signed int and signed long in per-syscall glue (that zeroing is guaranteed to denormalize int arguments). Once SYSCALL_DEFINE started to do normalization we disposed on the need to do separate wrappers for int arguments; that still leaves us with signed long ones, but * they are very rare * most of the syscalls passing them need compat for more serious reasons anyway. There are only two exceptions - bdflush(2) and pciconfig_iobase(2). The latter doesn't exist on sparc, the former ignores its signed long argument completely. So we are left with "zero upper halves of all argument-bearing registers upon the entry and have per-syscall glue take care of the rest". For s390 the situation is nastier - normalization for signed and unsigned long is the same as usual, but pointers might have junk in bit 31. IOW, for anything with pointer in arguments we can't just use the native syscall. As the result, s390 doesn't bother with zeroing upper halves in syscall dispatcher and does private mini-wrappers for native syscalls with pointer/long/ulong arguments. That kind of crap really needs to be documented - RTFS becomes somewhat painful when it involves tens of assemblers *and* missing ABI documents (try to locate one for something embedded - great motivation for expanding vocabulary, that) ;-/ FWIW, SYSCALL_DEFINE and its ilk (COMPAT_SYSCALL_DEFINE, s390 COMPAT_SYSCALL_WRAP, etc.) are all about stepping over the ABI gap - we've got some values from userland caller and we need to turn that into a valid C function call that would satisfy C ABI constraints. Some amount of normalization might've been done by syscall dispatcher; this stuff does the rest on per-function basis. > One way to implement this would be to put the argument chain types (string) and > sizes (int) into a special debug section which isn't included in the final kernel > image but which can be checked at link time. Umm... Possible, but I actually believe that we can do that without debug info. WARNING: AVERT YOUR EYES IF YOU HAVE A WEAK STOMACH. I'm _not_ saying that the trickery below is a good idea, no need to break out a straightjacket. It does have some merits, but in the current form it's ugly as hell and almost certainly not fit for inclusion. Consider that as a theoretical exercise, please. Again, don't read further if you are easily squicked. You've been warned. =============================================================================== Look: * we can generate enum {S0 = __TYPE_IS_LL(int), S1 = __TYPE_IS_LL(loff_t), S2 = __TYPE_IS_LL(size_t)}; at preprocessor level (inside the compat wrapper being built). Calculations will be done at compile time, of course, but those _are_ constants (0, 1, 0 in our case). * we can generate enum {OFF0 = 0, OFF1 = STEP(OFF0 + S0, S1), OFF2 = STEP(OFF1 + S1, S2)}; at the same time, with STEP(base, big) defined as base + 1 for something like x86, (big ? ((base) | 1) + 1 : (base) + 1) for arm and friends, base + 1 + (big && base == 3) for s390. Again, compile-time calculations. For x86 - (0, 1, 3), for arm - (0, 2, 4), etc. * we can generate C_S_moron((__force int)(S0 ? PAIR(OFF0) : ARG(OFF0)), (__force loff_t)(S1 ? PAIR(OFF1) : ARG(OFF1)), (__force size_t)(S2 ? PAIR(OFF2) : ARG(OFF2))); in the body of wrapper. With PAIR(n) defined as either ((u64)ARG(n) << 32) | ARG(n+1) or ((u64)ARG(n+1) << 32) | ARG(n) (depending upon endianness) and ARG(n) defined as (n == 0 ? a0 : n == 1 ? a1 : n == 2 ? a2 : n == 3 ? a3 : n == 4 ? a4 : n == 5 ? a5 : a6) Note that each of those suckers expands to a cascade of conditional expressions with all conditions being integer constant expressions, no more than one of those being true. In our case that crap will fold into C_S_moron((__force int)a0, (__force loff_t)(((u64)a2 << 32)|a1), (__force size_t)a3); as soon as optimizations at the level of 0 ? x : y => y and 1 ? x : y => x are done. And we have, in effect, static inline long C_S_moron(int, loff_t, size_t); long compat_SyS_moron(long a0, long a1, long a2, long a3, long a4, long a5, long a6) { return C_S_moron((__force int)a0, (__force loff_t)(((u64)a2 << 32)|a1), (__force size_t)a3); } static inline long C_S_moron(int fd, loff_t offset, size_t count) { whatever body you had for it } That - from COMPAT_SYSCALL_DEFINE3(moron, int, fd, loff_t, offset, size_t, count) { whatever body you had for it } We can use similar machinery for SYSCALL_DEFINE itself, so that SyS_moron() would be defined with (long, long, long, long, long, long) as arguments and not (long, long long, long) as we have now. It's not impossible to do. It won't be pretty, but that use of local enums allows to avoid unbearably long expansions. Benefits: * all SyS... wrappers (i.e. the thing that really ought to go into syscall tables) have the same type. * we could have SYSCALL_DEFINE produce a trivial compat wrapper, have explicit COMPAT_SYSCALL_DEFINE discard that thing and populate the compat syscall table *entirely* with compat_SyS_..., letting the linker sort it out. That way we don't need to keep track of what can use native and what needs compat in each compat table on biarch. * s390 compat wrappers would disappear with that approach. * we could even stop generating sys_... aliases - if syscall table is generated by slapping SyS_... or compat_SyS_... on the name given there, we don't need to _have_ those sys_... things at all. All SyS_... would have the same type, so the pile in syscalls.h would not be needed - we could generate the externs at the same time we generate the syscall table. And yes, it's a high-squick approach. I know and I'm not saying it's a good idea. OTOH, to quote the motto of philosophers and shell game operators, "there's something in it"... > For example this attempt at creating a new system call: > > SYSCALL_DEFINE3(moron, int, fd, loff_t, offset, size_t, count) > > ... would translate into something like: > > .name = "moron", .pattern = "WWW", .type = "int", .size = 4, > .name = NULL, .type = "loff_t", .size = 8, > .name = NULL, .type = "size_t", .size = 4, > .name = NULL, .type = NULL, .size = 0, /* end of parameter list */ > > i.e. "WDW". The build-time constraint checker could then warn about: > > # error: System call "moron" uses invalid 'WWW' argument mapping for a 'WDW' sequence > # please avoid long-long arguments or use 'SYSCALL_DEFINE3_WDW()' instead ... if you do 32bit build. > Each architecture can provide its own syscall parameter checking logic. Both > 'stack boundary' and parameter packing rules would be straightforward to express > if we had such a data structure. > > Also note that this tool could also check for optimum packing, i.e. if the new > system call is defined as: > > SYSCALL_DEFINE3_WDW(moron, int, fd, loff_t, offset, size_t, count) > Such tooling could also do other things, such as limit the C types used for system > call defines to a well-chosen set of ABI-safe types, such as: > > 3 key_t > 3 uint32_t > 4 aio_context_t > 4 mqd_t > 4 timer_t > 10 clockid_t > 10 gid_t > 10 loff_t > 10 long Uhhuh - ABI-safe is not how I would describe that. Sodding PITA, fortunately rare, is more like it... > 10 old_gid_t > 10 old_uid_t > 10 umode_t > 11 uid_t > 31 pid_t > 34 size_t > 69 unsigned int > 130 unsigned long > 226 int > > This would also allow us some cleanups as well, such as dropping the pointless > 'const' from arithmetic types in syscall definitions for example. > > etc. > > Basically this tool would be a secondary parser of the syscall arguments, with > most of the parsing and type sizing difficulties solved by the C parser already. > > I think this problem could be much more sanely solved via annotations and a bit of > tooling, than trying to trick CPP into doing this for us (which won't really work > in any case). See above. It *can* be done. Not by cpp alone, of course - you need compiler involvement.