Received: by 10.192.165.148 with SMTP id m20csp4094218imm; Mon, 23 Apr 2018 19:07:34 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+o82FVIH0hsnpu8fbhFUdk/nsb4frdC5R0GxWO11JQtZswMPsYwQkqiebZbGAZaltjwN8N X-Received: by 2002:a17:902:1a6:: with SMTP id b35-v6mr20592679plb.80.1524535654731; Mon, 23 Apr 2018 19:07:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524535654; cv=none; d=google.com; s=arc-20160816; b=VdchdlX6FFdD9GZTe9bKhYQl1rN2j4+cf8f75oCp/4+69aS5yrdRTNXNGkL5FQ+Ae2 nXAmbq7+ABp9QDTy6Kwf0w9rX+hCgog7YbLPewhtpKELFWnvIqI2JAIhPqD9pS75o1rO lczC6qvQ+/y86rOYD020gmvJS9COvv11wNrJRrHJ3WXRBuwBEODuOAqF0KvkjphHP3Dw +fvVZ8GBvQQVN7lCPjNVmqhmt6hDFlJVYoOdAciEUt0eGcjYcFiiNiuRoVR0G8ImtYaT PVPwf8xcSLx9glp3cJSfZKWX8jgfw0rnVY0qWchABTpTLwW6QDz76tAEnV4IGemjFldH fv+w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dmarc-filter:arc-authentication-results; bh=pxiEY3QuQ2FJOR6inGMa+N0tZqD+tCJJOAdEWzIJLdk=; b=aCcJpxbcnC49mv5GUt16/T1fnfEfBNyuiDrsYyHrypB2wix9hH2uz6aFpMVaZYizCE T/3jbYHshf5hgUPc7+s0i3solPf6uGhKlRz6ttSdAtaSDghaiRakvykDu3zM7vjIwq9J 07pjhpuqwfIyr9n/0jAtXChKdGqNsxCKfgZCSg6udzIAkYFsc1Vap8t0ZTVEu3dM1Q7Z dceuGe/eu3xBZuSYv+l0yGDy+HtTxYIhun0Rd+3q9qSs2zUDQV+aLIHyo4xWyxwfdYV1 ETMmCTvvNePtEetd4HlpGkOP/mZQF0e+wIHHmGarisPHqzmNzA/zt9aFju1gW/uYHG5e fdXQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a22-v6si105886pls.571.2018.04.23.19.06.55; Mon, 23 Apr 2018 19:07:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932745AbeDXCFC convert rfc822-to-8bit (ORCPT + 99 others); Mon, 23 Apr 2018 22:05:02 -0400 Received: from mail.kernel.org ([198.145.29.99]:60976 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932670AbeDXCE6 (ORCPT ); Mon, 23 Apr 2018 22:04:58 -0400 Received: from mail-wr0-f170.google.com (mail-wr0-f170.google.com [209.85.128.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id E51F921838 for ; Tue, 24 Apr 2018 02:04:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E51F921838 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org Received: by mail-wr0-f170.google.com with SMTP id d1-v6so46179341wrj.13 for ; Mon, 23 Apr 2018 19:04:57 -0700 (PDT) X-Gm-Message-State: ALQs6tC4DXyPMbfW/EF3gcpLF/a3aFQBRk/ZK0RjrWc8y0Mz5TF0O85u qJ1ptMfZWqGm9ciE92r+KqykvSWsHEWeHj7JlvO0Kg== X-Received: by 2002:adf:80ec:: with SMTP id 99-v6mr19463849wrl.120.1524535496282; Mon, 23 Apr 2018 19:04:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.247.15 with HTTP; Mon, 23 Apr 2018 19:04:36 -0700 (PDT) In-Reply-To: References: <20180420155542.122183-1-edumazet@google.com> <9ed6083f-d731-945c-dbcd-f977c5600b03@kernel.org> From: Andy Lutomirski Date: Mon, 23 Apr 2018 19:04:36 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH net-next 0/4] mm,tcp: provide mmap_hook to solve lockdep issue To: Eric Dumazet Cc: Andy Lutomirski , Eric Dumazet , "David S . Miller" , netdev , linux-kernel , Soheil Hassas Yeganeh , linux-mm , Linux API Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 23, 2018 at 2:38 PM, Eric Dumazet wrote: > Hi Andy > > On 04/23/2018 02:14 PM, Andy Lutomirski wrote: >> I would suggest that you rework the interface a bit. First a user would call mmap() on a TCP socket, which would create an empty VMA. (It would set vm_ops to point to tcp_vm_ops or similar so that the TCP code could recognize it, but it would have no effect whatsoever on the TCP state machine. Reading the VMA would get SIGBUS.) Then a user would call a new ioctl() or setsockopt() function and pass something like: > > >> >> struct tcp_zerocopy_receive { >> void *address; >> size_t length; >> }; >> >> The kernel would verify that [address, address+length) is entirely inside a single TCP VMA and then would do the vm_insert_range magic. > > I have no idea what is the proper API for that. > Where the TCP VMA(s) would be stored ? > In TCP socket, or MM layer ? MM layer. I haven't tested this at all, and the error handling is totally wrong, but I think you'd do something like: len = get_user(...); down_read(¤t->mm->mmap_sem); vma = find_vma(mm, start); if (!vma || vma->vm_start > start) return -EFAULT; /* This is buggy. You also need to check that the file is a socket. This is probably trivial. */ if (vma->vm_file->private_data != sock) return -EINVAL; if (len > vma->vm_end - start) return -EFAULT; /* too big a request. */ and now you'd do the vm_insert_page() dance, except that you don't have to abort the whole procedure if you discover that something isn't aligned right. Instead you'd just stop and tell the caller that you didn't map the full requested size. You might also need to add some code to charge the caller for the pages that get pinned, but that's an orthogonal issue. You also need to provide some way for user programs to signal that they're done with the page in question. MADV_DONTNEED might be sufficient. In the mmap() helper, you might want to restrict the mapped size to something reasonable. And it might be nice to hook mremap() to prevent user code from causing too much trouble. With my x86-writer-of-TLB-code hat on, I expect the major performance costs to be the generic costs of mmap() and munmap() (which only happen once per socket instead of once per read if you like my idea), the cost of a TLB miss when the data gets read (really not so bad on modern hardware), and the cost of the TLB invalidation when user code is done with the buffers. The latter is awful, especially in multithreaded programs. In fact, it's so bad that it might be worth mentioning in the documentation for this code that it just shouldn't be used in multithreaded processes. (Also, on non-PCID hardware, there's an annoying situation in which a recently-migrated thread that removes a mapping sends an IPI to the CPU that the thread used to be on. I thought I had a clever idea to get rid of that IPI once, but it turned out to be wrong.) Architectures like ARM that have superior TLB handling primitives will not be hurt as badly if this is used my a multithreaded program. > > > And I am not sure why the error handling would be better (point 4), unless we can return smaller @length than requested maybe ? Exactly. If I request 10MB mapped and only the first 9MB are aligned right, I still want the first 9 MB. > > Also how the VMA space would be accounted (point 3) when creating an empty VMA (no pages in there yet) There's nothing to account. It's the same as mapping /dev/null or similar -- the mm core should take care of it for you.