Received: by 10.192.165.148 with SMTP id m20csp3901973imm; Mon, 23 Apr 2018 14:41:38 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/mAjFV1JKsoEcH8BOP93h9fNRtQ3Mtlb708HCzY+cuaZ4TC4QaZKZIVQ1HQk5qwPha7zW5 X-Received: by 10.99.95.5 with SMTP id t5mr17889653pgb.165.1524519698670; Mon, 23 Apr 2018 14:41:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524519698; cv=none; d=google.com; s=arc-20160816; b=I/CXtfRHuKkSWij4MBdfp1G+Yy0Lm79QN6553UHQ9A2gQAep0dhmJEIUds6Aq8pKLm QNb/5PxEM+JPqbmgUWa5vw4iSV4YItP4rzaj8y0jOhyNAOR3i04rH9LrgpIqD4ua7XRF 3g0fgZVYYv9BBD/n7PX+qd0ymyWBWeYaOcCXEpIzvFmZZBfcsEIWttHpBGB73P2Y3LsB 5swmzWZsUPdeKSLUsFWe7col06kemHm1Eq8VBblsn9UNFGY+lmu2uBN6aszHfOLrIh3E lrUS7alhIZUcHn/4ChflU3WB5oHkSzUDwnj79jfhnCLdn2oF6A0SOI/dvqDIBadvqw5v xupQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=oT4QMS3bNZlemxc0qYvWejk5+d/Ag0unau89W+pSuLk=; b=Xx8NkmP1ZgMUC4G1+WfXHx4L+Dx4+dIU9WzPWpcsMhv3rQpauoE8cr8S9S69taGGoW lKje0EU8qrgoSpQEQ1Jb7eCR66dv0QV60gfJ7G6GYCSjjHWWQCMuvlZzd4qVt6h9qdkZ kytXoNhY+FjrO2Kw39zbLtmL5omA/yt+fea/2tzxK5uHXOIvF4pOANIdhjoj89RwrNkO mOJvNBmm+Bo6YiyPsQSdo4G4NQnU844skavinLKgIZg7XT9Ogk1+jmiOju6tS/pUnded xuQPtNuTR1VpifJdgX7qn0pR0Bo7Q2LpET2ajaZPjG/acOZR4KB0SQT6M/MKjl/BBFEb 0x7A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=X0DEejKG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u71si5615551pgb.332.2018.04.23.14.41.24; Mon, 23 Apr 2018 14:41:38 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=X0DEejKG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932622AbeDWVjo (ORCPT + 99 others); Mon, 23 Apr 2018 17:39:44 -0400 Received: from mail-pg0-f65.google.com ([74.125.83.65]:33363 "EHLO mail-pg0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932460AbeDWVir (ORCPT ); Mon, 23 Apr 2018 17:38:47 -0400 Received: by mail-pg0-f65.google.com with SMTP id i194so9275523pgd.0; Mon, 23 Apr 2018 14:38:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=oT4QMS3bNZlemxc0qYvWejk5+d/Ag0unau89W+pSuLk=; b=X0DEejKGk5dOCBh3CfJ+Ae0uiOkrLtJeeOoBwBpU8pwB/409OEo3lZt7gKcgxz9mGC +XFHvrQMaM5e48vDya+in/e72u7yCS+3hfotzafrUu7/78vnepviOw6gZYOVb3X0D1Ku Z2mcoog3ZmdOvKR1Ky3wc5goQcwlDEm/kWil41QJUNUNvjYxPMm625fToNaHNmUOa1GR PrUIOSHWvL8hPg4X4EKbKkhfbrCMxfbz6HMs3xUVpPRY73sLTUd9O6rNKMm+wp3NHdxy fvrQN42cCHQDQEQ0pPH+pCxGv6Ddg6WTG0qsOQdxbl4I9r77CVgw1LT66cKUNntQTsRz rCPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=oT4QMS3bNZlemxc0qYvWejk5+d/Ag0unau89W+pSuLk=; b=gYmNkqI1TEE5FwhIkFgleJPAMXegBjZzIsU40tlGlbtP0UElv8bnwyJLd5mUMlAwAn JFTlBjWHgGFFxfnniYRHl+Q/NTPOnV5Wc5MR9xFxRv8nZwxAHWcgi99gHmP9AdA1ikXp hK82rL0S0GJaVNH7D/ES/FDCP/GXSsw9bBUTkWvdwLEdJtwmafJ0IdsCkocn1qFibODz PIXGQgT7OBCNUBSxlgFTgoMKPt47XXOMiS8GpD+IWXNi+3yPKNmeBddwO0tsC+A8z/kF 2K3xaP9UjipO1XQPMsbH4a7YTxIlxcJQN68zKU8D2FADMeHLllKQwFoWdWX0+wJXEASQ yNLg== X-Gm-Message-State: ALQs6tB7BY465DJgkZpyFuCEoQhA32wIG1V0dI8sQOzKLvIppIs20sMP c/g+VOPigAQjMUQHp3oq741xu0Bx X-Received: by 10.98.147.200 with SMTP id r69mr14219859pfk.59.1524519525447; Mon, 23 Apr 2018 14:38:45 -0700 (PDT) Received: from ?IPv6:2620:15c:2c1:103:dcd8:b5d5:bf84:baad? ([2620:15c:2c1:103:dcd8:b5d5:bf84:baad]) by smtp.gmail.com with ESMTPSA id x84sm38296898pfi.160.2018.04.23.14.38.43 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Apr 2018 14:38:44 -0700 (PDT) Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue To: Andy Lutomirski , Eric Dumazet , "David S . Miller" Cc: netdev , linux-kernel , Soheil Hassas Yeganeh , Eric Dumazet , linux-mm , Linux API References: <20180420155542.122183-1-edumazet@google.com> <9ed6083f-d731-945c-dbcd-f977c5600b03@kernel.org> From: Eric Dumazet Message-ID: Date: Mon, 23 Apr 2018 14:38:43 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: <9ed6083f-d731-945c-dbcd-f977c5600b03@kernel.org> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Andy On 04/23/2018 02:14 PM, Andy Lutomirski wrote: > On 04/20/2018 08:55 AM, Eric Dumazet wrote: >> This patch series provide a new mmap_hook to fs willing to grab >> a mutex before mm->mmap_sem is taken, to ensure lockdep sanity. >> >> This hook allows us to shorten tcp_mmap() execution time (while mmap_sem >> is held), and improve multi-threading scalability. >> > > I think that the right solution is to rework mmap() on TCP sockets a bit.  The current approach in net-next is very strange for a few reasons: > > 1. It uses mmap() as an operation that has side effects besides just creating a mapping.  If nothing else, it's surprising, since mmap() doesn't usually do that.  But it's also causing problems like what you're seeing. > > 2. The performance is worse than it needs to be.  mmap() is slow, and I doubt you'll find many mm developers who consider this particular abuse of mmap() to be a valid thing to optimize for. > > 3. I'm not at all convinced the accounting is sane.  As far as I can tell, you're allowing unprivileged users to increment the count on network-owned pages, limited only by available virtual memory, without obviously charging it to the socket buffer limits.  It looks like a program that simply forgot to call munmap() would cause the system to run out of memory, and I see no reason to expect the OOM killer to have any real chance of killing the right task. > > 4. Error handling sucks.  If I try to mmap() a large range (which is the whole point -- using a small range will kill performance) and not quite all of it can be mapped, then I waste a bunch of time in the kernel and get *none* of the range mapped. > > I would suggest that you rework the interface a bit.  First a user would call mmap() on a TCP socket, which would create an empty VMA.  (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine.  Reading the VMA would get SIGBUS.)  Then a user would call a new ioctl() or setsockopt() function and pass something like: > > struct tcp_zerocopy_receive { >   void *address; >   size_t length; > }; > > The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic. I have no idea what is the proper API for that. Where the TCP VMA(s) would be stored ? In TCP socket, or MM layer ? And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ? Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet)   On success, length is changed to the length that actually got mapped.  The kernel could do this while holding mmap_sem for *read*, and it could get the lock ordering right.  If and when mm range locks ever get merged, it could switch to using a range lock. > > Then you could use MADV_DONTNEED or another ioctl/setsockopt to zap the part of the mapping that you're done with. > > Does this seem reasonable?  It should involve very little code change, it will run faster, it will scale better, and it is much less weird IMO. Maybe, although I do not see the 'little code change' yet. But at least, this seems pretty nice idea, especially if it could allow us to fill the mmap()ed area later when packets are received.