Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp2447037imm; Thu, 18 Oct 2018 15:00:48 -0700 (PDT) X-Google-Smtp-Source: ACcGV62xAfXgnDOzm2VhmE1vhf1brKLrehcApn7xJU4oPEa+qZrJF74Ra+lLMv3hxy9/Sk3TabFo X-Received: by 2002:a63:d945:: with SMTP id e5-v6mr9855896pgj.24.1539900048372; Thu, 18 Oct 2018 15:00:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539900048; cv=none; d=google.com; s=arc-20160816; b=Mf5jZY6L6sHXTg647OaoNjO2RtDcG5b/9DE8N6JSLcbTIQNDDAL1BuJqC88VIOHMAp IxArnFUs1znNhrYPmgLedexksK90wqdltnbhtLFMF248nejFPMxlYj1XcE3+UwCvBOPj p78EA49dEOOU7/98oDXM69FQ5duqFNUAFPdlTUrvGtYOLLmDgeL1gvGdksUjEsta66de OcK8jt9T/qzgb4WaIVr/giTzps7mqMzuohZ3EaHuRLxy18i209jnqv2MpbNRKaq2rJ00 Qrb72hgzJa94XUOxCqnXPWqqvrJXQgxEsC2uItg/MCU0q7+WuZ95YXasC7bJCkBi1BjF hfmA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=BGDHfKUcyNJRZNDAo8A9B43rs1FjCQQhgAbTdtFpevA=; b=qIVkPdgzQcjlwhUDhuzWL2OTEXevBge8LVZVCAE3LEcTuaowTBInlTkikgZ/zeiF9N CNgfahCs83Scnc02/hnusu18QL7oEYZ1jl/nBzUAD0oepJdNRIbLEvVblrUL7NgEpJIX MG7loVQrcROBWxESmT/IJ/Z+PBuThUkXaOltibF89THclLkz4odoBHe+kkyS1OZbUG3x j8W/wPvsXAE42UCOsB1i+L/g/LHLo5pAFpxpSMZ3j/nVNqVYc+HUgHedD9ed3JylnnmA YQ0jRBjAXVIHutBmDC1JM0hCBwkKy95P7r7674kcZZ9PA/rxfPJ09CoMN56LY410KrT5 RQnQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t1-v6si1850901ply.279.2018.10.18.15.00.32; Thu, 18 Oct 2018 15:00:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727014AbeJSGBt (ORCPT + 99 others); Fri, 19 Oct 2018 02:01:49 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33752 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726678AbeJSGBt (ORCPT ); Fri, 19 Oct 2018 02:01:49 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 4C00F308212C; Thu, 18 Oct 2018 21:58:48 +0000 (UTC) Received: from sky.random (ovpn-120-228.rdu2.redhat.com [10.10.120.228]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 643EF7AB66; Thu, 18 Oct 2018 21:58:43 +0000 (UTC) Date: Thu, 18 Oct 2018 17:58:42 -0400 From: Andrea Arcangeli To: Al Viro Cc: Christian Brauner , keescook@chromium.org, linux-kernel@vger.kernel.org, ebiederm@xmission.com, mcgrof@kernel.org, akpm@linux-foundation.org, joe.lawrence@redhat.com, longman@redhat.com, linux@dominikbrodowski.net, adobriyan@gmail.com, linux-api@vger.kernel.org, Miklos Szeredi , Eric Dumazet Subject: Re: [PATCH v3 2/2] sysctl: handle overflow for file-max Message-ID: <20181018215842.GE20140@redhat.com> References: <20181016223322.16844-1-christian@brauner.io> <20181016223322.16844-3-christian@brauner.io> <20181017003548.GA32577@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181017003548.GA32577@ZenIV.linux.org.uk> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.42]); Thu, 18 Oct 2018 21:58:49 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Al, On Wed, Oct 17, 2018 at 01:35:48AM +0100, Al Viro wrote: > On Wed, Oct 17, 2018 at 12:33:22AM +0200, Christian Brauner wrote: > > Currently, when writing > > > > echo 18446744073709551616 > /proc/sys/fs/file-max > > > > /proc/sys/fs/file-max will overflow and be set to 0. That quickly > > crashes the system. > > This commit sets the max and min value for file-max and returns -EINVAL > > when a long int is exceeded. Any higher value cannot currently be used as > > the percpu counters are long ints and not unsigned integers. This behavior > > also aligns with other tuneables that return -EINVAL when their range is > > exceeded. See e.g. [1], [2] and others. > > Mostly sane, but... get_max_files() users are bloody odd. The one in > file-max limit reporting looks like a half-arsed attempt in "[PATCH] fix > file counting". The one in af_unix.c, though... I don't remember how > that check had come to be - IIRC that was a strange fallout of a thread > with me, Andrea and ANK involved, circa 1999, but I don't remember details; > Andrea, any memories? It might be worth reconsidering... The change in > question is in 2.2.4pre6; what do we use unix_nr_socks for? We try to > limit the number of PF_UNIX socks by 2 * max_files, but max_files can be > huge *and* non-constant (i.e. it can decrease). What's more, unix_tot_inflight > is unsigned int and max_files might exceed 2^31 just fine since "fs: allow > for more than 2^31 files" back in 2010... Something's fishy there... Feels like a lifetime ago :), but looking into I remembered some of it. That thread was about some instability in unix sockets for an unrelated bug in the garbage collector. While reviewing your fix, I probably incidentally found a resource exhaustion problem in doing a connect();close() loop on a listening stream af_unix. I found an exploit somewhere in my home dated in 99 in ls -l. Then ANK found another resource exhaustion by sending datagram sockets, which I also found an exploit for in my home. ANK pointed out that a connect syscall allocates two sockets, one to be accepted by the listening process, the other is the connect itself. That must be the explanation of the "*2". The "max_files*2" is probably a patch was from you (which was not overflowing back then), in attempt to fix the garbage collector issue which initially looked like resource exhaustion. I may have suggested to check sk_max_ack_backlog and fail connect() in such case to solve the resource exhaustion, but my proposal was obviously broken because connect() would return an error when the backlog was full and I suppose I didn't implement anything like unix_wait_for_peer. So I guess (not 100% sure) the get_max_files()*2 check stayed, not because of the bug in the garbage collector that was fixed independently, but as a stop gap measure for the connect();close() loop resource exhaustion. I tried the exploit that does a connect();close() in a loop and it gracefully hangs in unix_wait_for_peer() after sk_max_ack_backlog connects. Out of curiosity I tried also the dgram exploit and it hangs in sock_alloc_send_pskb with sk_wmem_alloc_get(sk) < sk->sk_sndbuf check. The file*2 limit couldn't have helped that one anyway. If I set /proc/sys/net/core/somaxconn to 1000000 the exploit works fine again and the connect;close loop again allocates infinite amount of kernel RAM into a tiny RSS process and it triggered OOM (there was no OOM killer in v2.2 I suppose). By default it's 128. There's also sysctl_max_dgram_qlen for dgram that on Android is set to 600 (by default 10). I tend to think these resources are now capped by other means (notably somaxconn, sysctl_max_dgram_qlen, sk_wmem_alloc_get) and unix_nr_socks can be dropped. Or if that atomic counter is still needed it's not for a trivial exploit anymore than just does listen(); SIGSTOP() from one process and a connect();close() loop in another process. It'd require more than a listening socket and heavily forking or a large increase on the max number of file descriptors (a privileged op) to do a ton of listens, but forking has its own memory footprint in userland too. At the very least it should be a per-cpu counter synced to the atomic global after a threshold. The other reason for dropping is that it wasn't ideal that the trivial exploit could still allocated max_files*2 SYN skbs with a loop of connect;close, max_files*2 is too much already so I suppose it was only a stop-gap measure to begin with. Thanks, Andrea