Received: by 10.192.165.148 with SMTP id m20csp30525imm; Fri, 4 May 2018 06:15:22 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpxE3mjKnlgM3pE4nWt149O1siGM9afEUsAJUYauXqIWoUsmrWf4DEqWenUn3+uDSuDAWxd X-Received: by 2002:a63:bf44:: with SMTP id i4-v6mr22382415pgo.66.1525439722803; Fri, 04 May 2018 06:15:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525439722; cv=none; d=google.com; s=arc-20160816; b=lDnU4H02/T8PX4uxzigaifDYu9W1jb8eZWPrnuRKnLBqMr1PEH8lq8pGS13/+2XRQF kZ7lWGC3DAYwFT1ySR/LRs1OI3fUGgr3E2s89Y8h6NOhKVjzffNx3Vi5DLebkVp+hf8R P98x8ivuIhYI2V/nAZaOr9z9idOmnV51l3Gohbw79fBH2hW3yBZZ4Vz+C2/jovJN+N39 JTx34IqTkYfKC+Yd92bnCoT4nIZjMro5FB3G7X69yzrcVcqKRQV4DRw7GadU9QYElIHC MGhgV85luaPw1yXA/rpVqseFrw3uTBj/i++iHB8r0jqVMnci9kybJoNIe+6lt+gL7PJG 0N7g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=UYwCNmMNJABYgVyKbWaWPSiNQi94i6K5ilYIlkLW4zs=; b=DiH9Zlktaw+zRHzevuyUkU5kTI7FI6URkeX+8ze8hEAmrhp9X7fex5jzWA6ojhZAsx RHVmgn9OHi1xrsXcgr3Ha3RzEVNI1GxIhNR7a8aPTldX4HnAp/0s0Q68rNtWDgD94c0x Ecu1hNzzXLSa9OU25BhcQWnPYEKzaEDlor7zyhbiCbUjTrJVZUUGHzj6P/tNkuxUGDe8 GDF1oyR0KjlInftsIPTYPtKRb3PKfcOEMtW9NK+jDzuPwLhGDVAXIRlAqO4+5dO/iITy bYseUCq6Zr9FYVoqIMBMi73xcw1xT3/VD4JMF68eTSM1/dYu8d/iKl9eMM4YB0dsnp2r wObw== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=auQngPC1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u47-v6si13456211pgn.488.2018.05.04.06.15.03; Fri, 04 May 2018 06:15:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=auQngPC1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751283AbeEDNOq (ORCPT + 99 others); Fri, 4 May 2018 09:14:46 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:54170 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751107AbeEDNOp (ORCPT ); Fri, 4 May 2018 09:14:45 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=UYwCNmMNJABYgVyKbWaWPSiNQi94i6K5ilYIlkLW4zs=; b=auQngPC1Wy42XKRXWpXopbhPD bhXK429UMo+tHOCZ31ICbd/b6dF2SojPAtEM353se550CLpo6+jIVkAq80m05k8HlRzThkH0J2+Vy RXodvh5Q7hjmLnUS83LxytyFYoxRN8A36/nyY0tk2o21o22EXngnnRWhzW5b0Dzgq0yvzZ5dsT/MR y6Ji0Y4lZ9JqOmPyiq9VbQP+kLamuUqmZkrKCHyQKuJP2Qh0fRgdN7uFDXzkNblnUeQpBzM5vbhx+ LXpfhF9tdTSlPn7MVo92WUCatBi2zlAi00EHUiG10TXSnGDTcQ7B7C0n0Fi/NkDWlRWhDBlkgDegQ cKfV6ECbA==; Received: from willy by bombadil.infradead.org with local (Exim 4.90_1 #2 (Red Hat Linux)) id 1fEaXq-00007e-6g; Fri, 04 May 2018 13:14:42 +0000 Date: Fri, 4 May 2018 06:14:41 -0700 From: Matthew Wilcox To: Linus Torvalds Cc: Andrew Morton , Matthew Wilcox , linux-mm , Kees Cook , Linux Kernel Mailing List , Kernel Hardening Subject: Re: [PATCH 2/2] mm: Add kvmalloc_ab_c and kvzalloc_struct Message-ID: <20180504131441.GA24691@bombadil.infradead.org> References: <20180214182618.14627-1-willy@infradead.org> <20180214182618.14627-3-willy@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 04, 2018 at 07:42:52AM +0000, Linus Torvalds wrote: > On Wed, Feb 14, 2018 at 8:27 AM Matthew Wilcox wrote: > > +static inline __must_check > > +void *kvmalloc_ab_c(size_t n, size_t size, size_t c, gfp_t gfp) > > +{ > > + if (size != 0 && n > (SIZE_MAX - c) / size) > > + return NULL; > > + > > + return kvmalloc(n * size + c, gfp); > > Ok, so some more bikeshedding: I'm putting up the bikeshed against your house ... the colour is your choice! > - I really don't want to encourage people to use kvmalloc(). > > In fact, the conversion I saw was buggy. You can *not* convert a GFP_ATOMIC > user of kmalloc() to use kvmalloc. Not sure which conversion you're referring to; not one of mine, I hope? > - that divide is really really expensive on many architectures. 'c' and 'size' are _supposed_ to be constant and get evaluated at compile-time. ie you should get something like this on x86: 0: 48 b8 fe ff ff ff ff movabs $0x1ffffffffffffffe,%rax 7: ff ff 1f a: 48 39 c7 cmp %rax,%rdi d: 76 09 jbe 18 f: 48 c1 e7 03 shl $0x3,%rdi 13: e9 00 00 00 00 jmpq 18 14: R_X86_64_PLT32 malloc-0x4 18: 31 c0 xor %eax,%eax 1a: c3 retq Now, if someone's an idiot, then you'll get the divide done at runtime, and that'll be expensive. > Normal kernel allocations are *limited*. It's simply not ok to allocate > megabytes (or gigabytes) of mmory in general. We have serious limits, and > we *should* have serious limits. If people worry about the multiply > overflowing because a user is controlling the size valus, then dammit, such > a user should not be able to do a huge gigabyte vmalloc() that exhausts > memory and then spends time clearing it all! I agree. > So the whole notion that "overflows are dangerous, let's get rid of them" > somehow fixes a bug is BULLSHIT. You literally introduced a *new* bug by > removing the normal kmalloc() size limit because you thought that pverflows > are the only problem. Rather, I replaced one bug with another. The removed bug was one where we allocated 24 bytes and then indexed into the next slab object. The added bug was that someone can now persuade the driver to allocate gigabytes of memory. It's a less severe bug, but I take your point. We do have _some_ limits in vmalloc -- we fail if you're trying to allocate more memory than is in the machine, and we fail if there's insufficient contiguous space in the virtual address space. But, yes, this does allow people to allocate more memory than kmalloc would allow. > So stop doing things like this. We should not do a stupid divide, because > we damn well know that it is NOT VALID to allocate arrays that have > hundreds of fthousands of elements, or where the size of one element is > very big. > > So get rid of the stupid divide, and make the limits be much stricter. Like > saying "the array element size had better be smaller than one page" > (because honestly, bigger elements are not valid in the kernel), and "the > size of the array cannot be more than "pick-some-number-out-of-your-ass". > > So just make the divide go the hell away, a and check the size for validity. > > Something like > > if (size > PAGE_SIZE) > return NULL; > if (elem > 65535) > return NULL; > if (offset > PAGE_SIZE) > return NULL; > return kzalloc(size*elem+offset); > > and now you (a) guarantee it can't overflow and (b) don't make people use > crazy vmalloc() allocations when they damn well shouldn't. I find your faith in the size of structs in the kernel touching ;-) struct cmp_data { /* size: 290904, cachelines: 4546, members: 11 */ struct dec_data { /* size: 274520, cachelines: 4290, members: 10 */ struct cpu_entry_area { /* size: 180224, cachelines: 2816, members: 7 */ struct saved_cmdlines_buffer { /* size: 131104, cachelines: 2049, members: 5 */ struct debug_store_buffers { /* size: 131072, cachelines: 2048, members: 2 */ struct bunzip_data { /* size: 42648, cachelines: 667, members: 23 */ struct inflate_workspace { /* size: 42312, cachelines: 662, members: 2 */ struct xz_dec_lzma2 { /* size: 28496, cachelines: 446, members: 5 */ struct lzma_dec { /* size: 28304, cachelines: 443, members: 21 */ struct rcu_state { /* size: 17344, cachelines: 271, members: 34 */ struct pglist_data { /* size: 18304, cachelines: 286, members: 34 */ struct tss_struct { /* size: 12288, cachelines: 192, members: 2 */ struct bts_ctx { /* size: 12288, cachelines: 192, members: 3 */ Those are just the ones above 10kB. Sure, I can see some of them are boot time use only, or we allocate one per node, or whatever. But people do create arrays of these things. The biggest object we have in the slab_cache today is 23488 bytes (kvm_vcpu) -- at least on my laptop. Maybe there's some insane driver out there that's creating larger things. > And yeah, if somebody has a page size bigger than 64k, then the above can > still overflow. I'm sorry, that architecture s broken shit. > > Are there cases where vmalloc() is ok? Yes. But they should be rare, and > they should have a good reason for them. And honestly, even then the above > limits really really sound quite reasonable. There is no excuse for > million-entry arrays in the kernel. You are doing something seriously wrong > if you do those. Normally, yes. But then you get people like Google who want to have a million file descriptors open in a single process. So I'm leery of putting hard limits on, like the ones you suggest above, because I'm not going to be the one who Google come to when they want to exceed the limit. If you want to draw that line in the sand, then I'm happy to respin the patch in that direction. We really have two reasons for using vmalloc -- one is "fragmentation currently makes it impossible to allocate enough contiguous memory to satisfy your needs" and the other is "this request is for too much memory to satisfy through the buddy allocator". kvmalloc is normally (not always; see file descriptor example above) for the first kind of problem, but I wonder if kvmalloc() shouldn't have the same limit as kmalloc (2048 pages), then add a kvmalloc_large() which will not impose that limit check.