Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp3836868yba; Tue, 23 Apr 2019 10:24:03 -0700 (PDT) X-Google-Smtp-Source: APXvYqzNDfyV/LXbHD6Sus1ek2WlCMiciLIbUvkbJgs58H0kvlIql79sNYSeSMgPgc3+YT+O2nBN X-Received: by 2002:a63:f26:: with SMTP id e38mr26420808pgl.290.1556040243850; Tue, 23 Apr 2019 10:24:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556040243; cv=none; d=google.com; s=arc-20160816; b=h4AIuUDpdTVoM9bCWdGE0yuOJ+ZPBQX3ExtCDpEGC0ZUc9Jz9Jfx56gl1AB5R/MlhW J+7yDMJ3JevyQPSDJz2f44rLpZPDwzquBFpO1PwlAdeKgasm95tjQVGhlMKcbI85tks1 6OhQb4Ej86TbCHk98bwF1162K5v22Xqqt6/pJ5rN6Oet3v/lKKkszdfUgcAYA0Gi0+bQ HF4oNxY/k+qUtGblpuJdrkHQ6AiHklwibsPgnr0B9PmPZzPXZKLtCpzPtUInGF251ngu hg6pB5VolIyZB0g614g+CPV51sXh+IVkfcoWI2k7ddPqyoMn0GMFVrb9ngei4UqieD7M sdnA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=+giLp4a8jL/GPPJVSIyywkA7ZxF/x1pU/KSuhFEmvyQ=; b=ZVCSk88ZBJt3+65a8gjcgwW/UA3d0HA0FRA5jWcyJv7G0VnZ/SZPU/7bLUxAHGMCH+ tgVA2R1wTm96AH+q5xjhzQ2+jvEJghi6UkzAoCd+MF0svHKW5ntTBaDT+SD0edZfm0Td Qwt4xsCJLCLxgTJxTgf5AvTtrRGt9EOBHZa0QwiYgVRphlbn9CTUBGcHN1FzCr3BjJbL wGiCw9z07Tf0r3aCYMQmQrYmP63bxmhIaZvQAVPs/euNeNrE17EP5NvYTt1qXyYe6ak2 CE3GFxpfsxVuNmITmw1BBmsHa7YpgcAPIS1HSrs9zlXCibakB4BRolDf5UHFTOtrp01V Eeiw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=TFvMj1OA; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a23si16052528pls.188.2019.04.23.10.23.42; Tue, 23 Apr 2019 10:24:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=TFvMj1OA; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729106AbfDWRWN (ORCPT + 99 others); Tue, 23 Apr 2019 13:22:13 -0400 Received: from mail.kernel.org ([198.145.29.99]:53858 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726655AbfDWRWM (ORCPT ); Tue, 23 Apr 2019 13:22:12 -0400 Received: from mail-wr1-f53.google.com (mail-wr1-f53.google.com [209.85.221.53]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 5BBFF20835 for ; Tue, 23 Apr 2019 17:22:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1556040131; bh=grcgrY3pMNmTEi4dSVR08aPt/HGdaYVJ6mWMikEzvUs=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=TFvMj1OASrW2PlOIHHSLsph+0jExDlsQlVW4xOTve7R36W62W+x087LP6tQSXJ4aC /Wwb3/T4y2xj8k8EVsFm8/QssePYTsviHys+teFjv6sHZ4bDhb/3FSQVVDCxYApDKW Nx4Dhk2thpuVrL7Tg4VQYkerFchuWLxT3jgohfSs= Received: by mail-wr1-f53.google.com with SMTP id c6so15542259wrm.1 for ; Tue, 23 Apr 2019 10:22:11 -0700 (PDT) X-Gm-Message-State: APjAAAXdWlcOsyyebqcwgKOGIaOzKAxpMXKF2A77bDh+xdl2cElk5XUG Av4fk2K7SkcCp+C8WOAoSAVNbEESUasZ9Wv6KByWsw== X-Received: by 2002:adf:fb4a:: with SMTP id c10mr3061569wrs.309.1556040129985; Tue, 23 Apr 2019 10:22:09 -0700 (PDT) MIME-Version: 1.0 References: <20190421160600.GA31092@avx2> <20190421182842.GD35603@gmail.com> <8B42CD57-9343-4234-A96D-80337BFFDF0E@zytor.com> <20190421211007.GA30444@avx2> <20190422103449.GA75723@gmail.com> <20190422220948.GB26031@avx2> <20190423111258.GA23410@gmail.com> In-Reply-To: <20190423111258.GA23410@gmail.com> From: Andy Lutomirski Date: Tue, 23 Apr 2019 10:21:59 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] x86_64: uninline TASK_SIZE To: Ingo Molnar Cc: Andy Lutomirski , Alexey Dobriyan , "H. Peter Anvin" , Thomas Gleixner , Ingo Molnar , Borislav Petkov , LKML , X86 ML , Peter Zijlstra , Linus Torvalds , Al Viro Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 23, 2019 at 4:13 AM Ingo Molnar wrote: > > > * Andy Lutomirski wrote: > > > > Saving 2KB on a defconfig is quite a lot. > > > > Saving 2kB of text by adding 8 bytes to thread_info seems rather > > dubious to me. You only need 256 tasks before you lose. My > > not-particularly-loaded laptop has 865 tasks right now. > > I was suggesting current->task_size or thread_info->task_size as a way to > 100% avoid the function call overhead. Worth a tiny amount of RAM - even > with 1 million tasks it's only 4MB of RAM. ;-) > > Some TASK_SIZE users are prominent syscalls: mmap(), > > > As a general principle, the mere existence of TIF_ADDR32 is a bug. The > > value of that flag is *wrong* under the 32-bit variant of CRIU. How > > about instead making some more progress toward getting rid of dubious > > TASK_SIZE users? I'm working on a little series to get rid of most of > > them. Meanwhile: it sure looks like a large fraction of the users are > > confused as to whether TASK_SIZE is the highest user address or the > > lowest non-user address. > > I really like that, replacing TASK_SIZE with *nothing* would be even > faster. > > In fact instead of just reducing its usage I'd suggest removing TASK_SIZE > altogether and renaming TASK_SIZE_MAX back to TASK_SIZE, or something > like that - the confusion from the deceptively macro-ish naming of > TASK_SIZE is real. > > The original commit of making TASK_SIZE dynamic on the task's compat flag > was done in: > > 84929801e14d: [PATCH] x86_64: TASK_SIZE fixes for compatibility mode processes > > Here's the justification given: > > Appended patch will setup compatibility mode TASK_SIZE properly. This will > fix atleast three known bugs that can be encountered while running > compatibility mode apps. > > a) A malicious 32bit app can have an elf section at 0xffffe000. During > exec of this app, we will have a memory leak as insert_vm_struct() is > not checking for return value in syscall32_setup_pages() and thus not > freeing the vma allocated for the vsyscall page. And instead of exec > failing (as it has addresses > TASK_SIZE), we were allowing it to > succeed previously. > > b) With a 32bit app, hugetlb_get_unmapped_area/arch_get_unmapped_area > may return addresses beyond 32bits, ultimately causing corruption > because of wrap-around and resulting in SEGFAULT, instead of returning > ENOMEM. > > c) 32bit app doing this below mmap will now fail. > > mmap((void *)(0xFFFFE000UL), 0x10000UL, PROT_READ|PROT_WRITE, > MAP_FIXED|MAP_PRIVATE|MAP_ANON, 0, 0); > > I believe a) is addressed by getting rid of the vsyscall page - but it > might also not be a current problem because the vsyscall page has its own > gate-vma now. > I suspect that whatever issue this is predates my involvement in Linux :) The "gate" VMA, aka vsyscall, is at 0xffffffffff600000, and the vDSO setup code shouldn't anywhere near that fragile. Also, if this really a bug, we have it for the 64-bit case, too, and TASK_SIZE isn't going to help. > b) shouldn't be an issue if the mmap allocator correctly treats the > compat bit - this doesn't require generic TASK_SIZE variations though, as > the mmap allocator is already specific to arch/x86/. This was mostly fixed by the CRIU folks a while back, I think. > > c) is a variant of a) I believe, which should be fixed by now. > > I just looked through some of the current TASK_SIZE users, and *ALL* of > them seem dubious to me, with the exception of the mmap allocators. In Even the munmap one seems to be a bug. > fact some of them seem to be active bugs: > > kernel/: > > - PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE, prctl_set_mm() et al. Ugh, what a > nasty prctl()! But besides that, the TASK_SIZE restriction to the ABI > looks questionable: if we offer this CRIU functionality then why > should it be impossible for a 32-bit CRIU task to switch to 64-bit? > > - kernel/events/core.c: TASK_SIZE_MAX should be a fine filter here, in > fact it's probably *wrong* to restrict the profiling data here just > because the task happens to be in 32-bit compat mode. Yep. I have a patch fo rthis. > > - kernel/rseq.c: is this TASK_SIZE restriction even required, wouldn't > TASK_SIZE_MAX be sufficient? Presumably. > So I concur 100% that most TASK_SIZE uses are questionable. In fact think > 84929801e14d was a mistake, and we should effectively revert it > carefully, by: > > - First by moving almost all TASK_SIZE users over to TASK_SIZE_MAX, > analyzing and justifying the impact case by case. > > - Then making the mmap allocators compat compatible (ha) without relying > on TASK_SIZE. I have a somewhat hacky patch for this right now. > > - Renaming TASK_SIZE back to TASK_SIZE_MAX and getting rid of the > TASK_SIZE and TASK_SIZE_MAX differentiation. > > Or am I missing some complication? Seems like a great idea to me. BTW, what the heck is up with get_gate_page()? I'm struggling to understand what it's even trying to do. If there's an architecture that allows a user program to mremap() or otherwise position its gate VMA between TASK_SIZE and TASK_SIZE_MAX, then that code is going to explode horribly. A whole bunch of work in this direction is here: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes It's almost entirely untested.