Received: by 2002:a25:23cc:0:0:0:0:0 with SMTP id j195csp572904ybj; Tue, 5 May 2020 04:06:26 -0700 (PDT) X-Google-Smtp-Source: APiQypLPI8w998OkBluJhXOTI8dj3Df1w1AC4xRZo5i4Sed2sgHlfLqfybtYiR4JWxRgoP/cewHy X-Received: by 2002:a50:9547:: with SMTP id v7mr2082022eda.324.1588676786202; Tue, 05 May 2020 04:06:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588676786; cv=none; d=google.com; s=arc-20160816; b=K+cvu/UgsiQg8I8sAjrfQGa70DqICcYcd5Ojv8gRAV1XT6iZycNdv7GvgnL+5bRvfK 9ypFJYqd8+whYnswcvfH6rn/UjU6MZix7mbg2oniMrEMLZVEpAlxjdWFskxbGSbjBpbZ yvQ/oaEFWPA+d+s5fsdzbC8eCCsg+NZPlB5lVb8HgGOpCDHCkQJu5cWQkSi+8IqBudE8 NSsWZ5cp9/CAX1xn7f11zuQj257l7/zIU0nTZD9TLu9XFG6ZPgciMPECYugmLeECWZLy 5kRt0kb58AXdFyOIP2gAG1CWbPuH6DZKllarToRIo9JlwsYrqOWn2qsK2kNrPpww3Eus xWeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=vhKTAKShqiLwsgFi+N3mfyekdE32lNjpJ2pYN0PhFcA=; b=0FkX8OK9/RzM4Y+MfwmmYGbSpES6/bob7kWyREat5D+iLPtD4AdRTYTFLq40bN61/i n12ifN6XpaqLr+ZZPa3j5OTQyUw8osw2YF0kABu5u294mE6e9EVFlsLrev/ynN8rJIOr hQ/CQc7y13kiVo7b8vSsxLpjDc6ysq3lJotAqXJzKf5raPexvrv2rEeVyXOdZPr/qUln 5zNjK9yIYShDtc9yvWLHGN4C3S5UU+zJZPbpx1d8d2ZuB51L8OB9nH58mNEhuFmPS1ev bMZEQX8j3Bay2W8ds9aBOIPT8Bgr6Gm90Sx8TmKYDVp9CnNceWSraB5Ui4NP5Zh6JLO9 5nxA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j19si990408eds.208.2020.05.05.04.06.01; Tue, 05 May 2020 04:06:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728732AbgEELED (ORCPT + 99 others); Tue, 5 May 2020 07:04:03 -0400 Received: from verein.lst.de ([213.95.11.211]:34717 "EHLO verein.lst.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725766AbgEELED (ORCPT ); Tue, 5 May 2020 07:04:03 -0400 Received: by verein.lst.de (Postfix, from userid 2407) id 11E5F68C4E; Tue, 5 May 2020 13:03:59 +0200 (CEST) Date: Tue, 5 May 2020 13:03:58 +0200 From: Christoph Hellwig To: Jann Horn Cc: Andrew Morton , Linus Torvalds , Christoph Hellwig , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Alexander Viro , "Eric W . Biederman" , Oleg Nesterov , Russell King , linux-arm-kernel@lists.infradead.org, Mark Salter , Aurelien Jacquiot , linux-c6x-dev@linux-c6x.org, Yoshinori Sato , Rich Felker , linux-sh@vger.kernel.org Subject: Re: [PATCH v2 4/5] binfmt_elf, binfmt_elf_fdpic: Use a VMA list snapshot Message-ID: <20200505110358.GC17400@lst.de> References: <20200429214954.44866-1-jannh@google.com> <20200429214954.44866-5-jannh@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200429214954.44866-5-jannh@google.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 29, 2020 at 11:49:53PM +0200, Jann Horn wrote: > In both binfmt_elf and binfmt_elf_fdpic, use a new helper > dump_vma_snapshot() to take a snapshot of the VMA list (including the gate > VMA, if we have one) while protected by the mmap_sem, and then use that > snapshot instead of walking the VMA list without locking. > > An alternative approach would be to keep the mmap_sem held across the > entire core dumping operation; however, keeping the mmap_sem locked while > we may be blocked for an unbounded amount of time (e.g. because we're > dumping to a FUSE filesystem or so) isn't really optimal; the mmap_sem > blocks things like the ->release handler of userfaultfd, and we don't > really want critical system daemons to grind to a halt just because someone > "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or something > like that. > > Since both the normal ELF code and the FDPIC ELF code need this > functionality (and if any other binfmt wants to add coredump support in the > future, they'd probably need it, too), implement this with a common helper > in fs/coredump.c. > > A downside of this approach is that we now need a bigger amount of kernel > memory per userspace VMA in the normal ELF case, and that we need O(n) > kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't > be terribly bad. > > Signed-off-by: Jann Horn > --- > fs/binfmt_elf.c | 152 +++++++++++++-------------------------- > fs/binfmt_elf_fdpic.c | 86 ++++++++++------------ > fs/coredump.c | 68 ++++++++++++++++++ > include/linux/coredump.h | 10 +++ > 4 files changed, 168 insertions(+), 148 deletions(-) > > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c > index fb36469848323..dffe9dc8497ca 100644 > --- a/fs/binfmt_elf.c > +++ b/fs/binfmt_elf.c > @@ -1292,8 +1292,12 @@ static bool always_dump_vma(struct vm_area_struct *vma) > return false; > } > > +#define DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER 1 > + > /* > * Decide what to dump of a segment, part, all or none. > + * The result must be fixed up via vma_dump_size_fixup() once we're in a context > + * that's allowed to sleep arbitrarily long. > */ > static unsigned long vma_dump_size(struct vm_area_struct *vma, > unsigned long mm_flags) > @@ -1348,30 +1352,15 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma, > > /* > * If this looks like the beginning of a DSO or executable mapping, > - * check for an ELF header. If we find one, dump the first page to > - * aid in determining what was mapped here. > + * we'll check for an ELF header. If we find one, we'll dump the first > + * page to aid in determining what was mapped here. > + * However, we shouldn't sleep on userspace reads while holding the > + * mmap_sem, so we just return a placeholder for now that will be fixed > + * up later in vma_dump_size_fixup(). > */ > if (FILTER(ELF_HEADERS) && > - vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ)) { > - u32 __user *header = (u32 __user *) vma->vm_start; > - u32 word; > - /* > - * Doing it this way gets the constant folded by GCC. > - */ > - union { > - u32 cmp; > - char elfmag[SELFMAG]; > - } magic; > - BUILD_BUG_ON(SELFMAG != sizeof word); > - magic.elfmag[EI_MAG0] = ELFMAG0; > - magic.elfmag[EI_MAG1] = ELFMAG1; > - magic.elfmag[EI_MAG2] = ELFMAG2; > - magic.elfmag[EI_MAG3] = ELFMAG3; > - if (unlikely(get_user(word, header))) > - word = 0; > - if (word == magic.cmp) > - return PAGE_SIZE; > - } > + vma->vm_pgoff == 0 && (vma->vm_flags & VM_READ)) > + return DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER; > > #undef FILTER > > @@ -1381,6 +1370,22 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma, > return vma->vm_end - vma->vm_start; > } > > +/* Fix up the result from vma_dump_size(), now that we're allowed to sleep. */ > +static void vma_dump_size_fixup(struct core_vma_metadata *meta) > +{ > + char elfmag[SELFMAG]; > + > + if (meta->dump_size != DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER) > + return; > + > + if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG)) { > + meta->dump_size = 0; > + return; > + } > + meta->dump_size = > + (memcmp(elfmag, ELFMAG, SELFMAG) == 0) ? PAGE_SIZE : 0; > +} While this code looks entirely correct, it took me way too long to follow. I'd take te DUMP_SIZE_MAYBE_ELFHDR_PLACEHOLDER into the caller, and then have a simple function like this: static void vma_dump_size_fixup(struct core_vma_metadata *meta) { char elfmag[SELFMAG]; if (copy_from_user(elfmag, (void __user *)meta->start, SELFMAG) || memcmp(elfmag, ELFMAG, sizeof(elfmag)) != 0) meta->dump_size = 0; else meta->dump_size = PAGE_SIZE; } Also a few comments explaining why we do this fixup would help readers of the code. > - if (vma->vm_flags & VM_WRITE) > - phdr.p_flags |= PF_W; > - if (vma->vm_flags & VM_EXEC) > - phdr.p_flags |= PF_X; > + phdr.p_flags = meta->flags & VM_READ ? PF_R : 0; > + phdr.p_flags |= meta->flags & VM_WRITE ? PF_W : 0; > + phdr.p_flags |= meta->flags & VM_EXEC ? PF_X : 0; Sorry for another nitpick, but I find the spelled out version with the if a lot easier to read. > +int dump_vma_snapshot(struct coredump_params *cprm, int *vma_count, > + struct core_vma_metadata **vma_meta, > + unsigned long (*dump_size_cb)(struct vm_area_struct *, unsigned long)) Plase don't use single tabs for indentating parameter continuations (we actually have two styles - either two tabs or aligned after the opening brace, pick your poison :)) > + *vma_meta = kvmalloc_array(*vma_count, sizeof(**vma_meta), GFP_KERNEL); > + if (!*vma_meta) { > + up_read(&mm->mmap_sem); > + return -ENOMEM; > + } > + > + for (i = 0, vma = first_vma(current, gate_vma); vma != NULL; > + vma = next_vma(vma, gate_vma)) { > + (*vma_meta)[i++] = (struct core_vma_metadata) { > + .start = vma->vm_start, > + .end = vma->vm_end, > + .flags = vma->vm_flags, > + .dump_size = dump_size_cb(vma, cprm->mm_flags) > + }; This looks a little weird. Why not kcalloc + just initialize the four fields we actually fill out here?