Received: by 10.192.165.148 with SMTP id m20csp4199850imm; Mon, 23 Apr 2018 21:32:09 -0700 (PDT) X-Google-Smtp-Source: AIpwx48CKW0AgBGgiiDLsJB6Bf9xfXaKtAl7mq6SngS/ncZUf3Au0fmqcvVg6RUvCsn3kI042nEs X-Received: by 10.99.127.86 with SMTP id p22mr19186524pgn.306.1524544329087; Mon, 23 Apr 2018 21:32:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524544329; cv=none; d=google.com; s=arc-20160816; b=p9sGgMVn37fMEjmA7QJGZAWf1s9REwowF+pJy0tcVUqGXFCXo5aLpEcWqOPQhfzhZ2 klsF8ABHeEwTN0APGfdgRysG1J+JWxqowyqrWey186FxqxaC9V51UJaeZaCmqfsTpJdT 2Vakjg2p2Vl9iqsys6VJQ5FatIo0ym3Ox8KPmbPtLOItsC6klIXGLL/Wsdv2NXRTMTlc kJopHRHHGpZf8uOB9KSZc2b1nQ39g4ThlXveqUwSCWo53+suCXGEIn5pRtLNX6K8By24 HVw8Nv4dWzG2bea0lk05QOxkf8R6mTiKIfipJPvfOuFlIiroB1I/ctq7tRfbvbFiVOS9 7QUw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=6BO5vH3sJ0gtxpIl0bmhCppdCrbBcVgMfEU9RMa/AAo=; b=wM6JT1M70MxmRsubv+oV/COsbLGqVexp8LWTp0Bk1NijSPF04RQUSp2zWscVf0NHVe uIuV24Wp7tkIwIRvuAKUHgD3KVCcdJ33MJ6TTtJ/HJFEgFwBCARR7ez4v9vlJ3tFqyHk YtCv+hFVUp+GneB/ModTZaLYMhsUrkQVTehuiVk3mvwYs1x7i1MzSLo9+YHHC832p2k7 hMTJUBDV/v6rQKS2iXUA8fvL98ewWSFkZOdK5Iva2EFIHtKRJy0+9K7XPHxg/Vlks7dK kkzL6T3zTXrREiDSRreh2FNqOQAFq9Eq2cxl4pgnMkTXKTe84Cwyf02O9CgQWsII6Dh2 9bZQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pU9sSOvs; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w13si7915841pgr.65.2018.04.23.21.31.54; Mon, 23 Apr 2018 21:32:09 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pU9sSOvs; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752207AbeDXEaj (ORCPT + 99 others); Tue, 24 Apr 2018 00:30:39 -0400 Received: from mail-pg0-f67.google.com ([74.125.83.67]:43980 "EHLO mail-pg0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750902AbeDXEah (ORCPT ); Tue, 24 Apr 2018 00:30:37 -0400 Received: by mail-pg0-f67.google.com with SMTP id f132so9892844pgc.10; Mon, 23 Apr 2018 21:30:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=6BO5vH3sJ0gtxpIl0bmhCppdCrbBcVgMfEU9RMa/AAo=; b=pU9sSOvs+Kj8x9nLhdu5LP27oFoyqZ0AOd1TRrfmsMHDFWwAI7z8PC3LRQDSvJTMiZ U5w01dhc1YsJ0zDP6p7Luz7irXBeAdhr+fPrMAdDezY9F3eEuL26d6tPmUHHY7nTTgBg bScYt25YcQA5YO99TELsQ0FSb1iXRfwhhygxiZpH8uHjatBTZKw/sK4DoXEmkKFTxPXD 5abWSa0V4PUM2Fw7DrOVnlwZuxiUNcnBI+InPQu3tp21fRmrKkMHciB1Dfq8sh0vVvHY gsvAHN/ciR3gGpuRDompoy2CNRVt2rOFHKoholZQHj1dX1TmN+Ng1oWvdwJwmWXRqkaP /oQw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=6BO5vH3sJ0gtxpIl0bmhCppdCrbBcVgMfEU9RMa/AAo=; b=h343tLhdcFugLIR23CsDAvU8q/0wBLsYLkHuOUep40p/oRGloObf9VP5YRBZabIu6C orQEqiV/fl+f3UMs67luTT6kkfiiWJIkDJyJy/Cp4TBeRTSOxJfR4PfQhiLJ9GLOHl/X Uz+VjSZJSXBhmCeNI0X8/63iy61KWXVjfT0SyrZ7w8SNxXGIgp4DhU9siJ7TaL9iwI5f O5XPWJmXe3FSX8fQk4l9NMP4XEKjTGaR+6ArFaXibzzqZKW+8KZXMfPY4c4o8OHLyVR7 3ynUzp+EqGyXPPt4Q8ARzOQNtuvPiA8z/ngIW62XFLbx1CH82PCgHOSRIOzZBlEZFAbY 2XJg== X-Gm-Message-State: ALQs6tCr9aiZQOugQLTPadeSjzNqh10Uc/wKdop2DTPrJ/a8h1kzrObX BB39npr22V4e/gPRL+jS1wdeTUBX X-Received: by 2002:a17:902:6505:: with SMTP id b5-v6mr23350211plk.147.1524544236658; Mon, 23 Apr 2018 21:30:36 -0700 (PDT) Received: from [192.168.86.235] (c-67-180-167-114.hsd1.ca.comcast.net. [67.180.167.114]) by smtp.gmail.com with ESMTPSA id b5sm28218245pfc.16.2018.04.23.21.30.34 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 23 Apr 2018 21:30:35 -0700 (PDT) Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue To: Andy Lutomirski , Eric Dumazet Cc: Eric Dumazet , "David S . Miller" , netdev , linux-kernel , Soheil Hassas Yeganeh , linux-mm , Linux API References: <20180420155542.122183-1-edumazet@google.com> <9ed6083f-d731-945c-dbcd-f977c5600b03@kernel.org> From: Eric Dumazet Message-ID: Date: Mon, 23 Apr 2018 21:30:34 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/23/2018 07:04 PM, Andy Lutomirski wrote: > On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet wrote: >> Hi Andy >> >> On 04/23/2018 02:14 PM, Andy Lutomirski wrote: > >>> I would suggest that you rework the interface a bit. First a user would call mmap() on a TCP socket, which would create an empty VMA. (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine. Reading the VMA would get SIGBUS.) Then a user would call a new ioctl() or setsockopt() function and pass something like: >> >> >>> >>> struct tcp_zerocopy_receive { >>> void *address; >>> size_t length; >>> }; >>> >>> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic. >> >> I have no idea what is the proper API for that. >> Where the TCP VMA(s) would be stored ? >> In TCP socket, or MM layer ? > > MM layer. I haven't tested this at all, and the error handling is > totally wrong, but I think you'd do something like: > > len = get_user(...); > > down_read(¤t->mm->mmap_sem); > > vma = find_vma(mm, start); > if (!vma || vma->vm_start > start) > return -EFAULT; > > /* This is buggy. You also need to check that the file is a socket. > This is probably trivial. */ > if (vma->vm_file->private_data != sock) > return -EINVAL; > > if (len > vma->vm_end - start) > return -EFAULT; /* too big a request. */ > > and now you'd do the vm_insert_page() dance, except that you don't > have to abort the whole procedure if you discover that something isn't > aligned right. Instead you'd just stop and tell the caller that you > didn't map the full requested size. You might also need to add some > code to charge the caller for the pages that get pinned, but that's an > orthogonal issue. > > You also need to provide some way for user programs to signal that > they're done with the page in question. MADV_DONTNEED might be > sufficient. > > In the mmap() helper, you might want to restrict the mapped size to > something reasonable. And it might be nice to hook mremap() to > prevent user code from causing too much trouble. > > With my x86-writer-of-TLB-code hat on, I expect the major performance > costs to be the generic costs of mmap() and munmap() (which only > happen once per socket instead of once per read if you like my idea), > the cost of a TLB miss when the data gets read (really not so bad on > modern hardware), and the cost of the TLB invalidation when user code > is done with the buffers. The latter is awful, especially in > multithreaded programs. In fact, it's so bad that it might be worth > mentioning in the documentation for this code that it just shouldn't > be used in multithreaded processes. (Also, on non-PCID hardware, > there's an annoying situation in which a recently-migrated thread that > removes a mapping sends an IPI to the CPU that the thread used to be > on. I thought I had a clever idea to get rid of that IPI once, but it > turned out to be wrong.) > > Architectures like ARM that have superior TLB handling primitives will > not be hurt as badly if this is used my a multithreaded program. > >> >> >> And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ? > > Exactly. If I request 10MB mapped and only the first 9MB are aligned > right, I still want the first 9 MB. > >> >> Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet) > > There's nothing to account. It's the same as mapping /dev/null or > similar -- the mm core should take care of it for you. > Thanks Andy, I am working on all this, and initial patch looks sane enough. include/uapi/linux/tcp.h | 7 + net/ipv4/tcp.c | 175 +++++++++++++++++++++++------------------------ 2 files changed, 93 insertions(+), 89 deletions(-) I will test all this before sending for review asap. ( I have not done the compat code yet, this can be done later I guess)