Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754692AbbBOPQq (ORCPT ); Sun, 15 Feb 2015 10:16:46 -0500 Received: from mail-wg0-f46.google.com ([74.125.82.46]:55025 "EHLO mail-wg0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754298AbbBOPQo (ORCPT ); Sun, 15 Feb 2015 10:16:44 -0500 Message-ID: <54E0B857.30902@gmail.com> Date: Sun, 15 Feb 2015 16:16:39 +0100 From: "Michael Kerrisk (man-pages)" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Omar Sandoval , Fam Zheng CC: mtk.manpages@gmail.com, linux-kernel@vger.kernel.org, Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org, Alexander Viro , Andrew Morton , Kees Cook , Andy Lutomirski , David Herrmann , Alexei Starovoitov , Miklos Szeredi , David Drysdale , Oleg Nesterov , "David S. Miller" , Vivek Goyal , Mike Frysinger , "Theodore Ts'o" , Heiko Carstens , Rasmus Villemoes , Rashika Kheria , Hugh Dickins , Mathieu Desnoyers , Peter Zijlstra , linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, Josh Triplett , Paolo Bonzini Subject: Re: [PATCH RFC v3 0/7] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1 References: <1423818243-15410-1-git-send-email-famz@redhat.com> <20150213095357.GA20668@mew> In-Reply-To: <20150213095357.GA20668@mew> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4928 Lines: 108 On 02/13/2015 10:53 AM, Omar Sandoval wrote: > On Fri, Feb 13, 2015 at 05:03:56PM +0800, Fam Zheng wrote: >> Hi all, >> >> This is the updated series for the new epoll system calls, with the cover >> letter rewritten which includes some more explanation. Comments are very >> welcome! >> >> Original Motivation >> =================== >> >> QEMU, and probably many more select/poll based applications, will consider >> epoll as an alternative, when its event loop needs to handle a big number of >> fds. However, there are currently two concerns with epoll which prevents the >> switching from ppoll to epoll. >> >> The major one is the timeout precision. >> >> For example in QEMU, the main loop takes care of calling callbacks at a >> specific timeout - the QEMU timer API. The timeout value in ppoll depends on >> the next firing timer. epoll_pwait's millisecond timeout is so coarse that >> rounding up the timeout will hurt performance badly. >> >> The minor one is the number of system call to update fd set. >> >> While epoll can handle a large number of fds quickly, it still requires one >> epoll_ctl per fd update, compared to the one-shot call to select/poll with an >> fd array. This may as well make epoll inferior to ppoll in the cases where a >> small, but frequently changing set of fds are polled by the event loop. >> >> This series introduces two new epoll APIs to address them respectively. The >> idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also >> suggested clockid as a parameter in epoll_pwait1. >> >> Discussion >> ========== >> >> [Note: This is the question part regarding the interface contract of >> epoll_ctl_batch. If you don't have the context of what is epoll_ctl_batch yet, >> please skip this part and probably start with the man page style documentation. >> You can resume to this section later.] >> >> [Thanks to Omar Sandoval , who pointed out this in >> reviewing v1] >> >> We try to report status for each command in epoll_ctl_batch, by writting to >> user provided command array (pointed to cmds). The tricky thing in the >> implementation is that, copying the results back to userspace comes last, after >> the commands are executed. At this point, if the copy_to_user fails, the >> effects are done and no return - or if we add some mechanism to revert it, the >> code will be too complicated and slow. >> >> In above corner case, the return value of epoll_ctl_batch is smaller than >> ncmds, which assures our caller that last N commands failed, where N = ncmds - >> ret. But they'll also find that cmd.result is not changed to error code. >> >> I suppose this is not a big deal, because 1) it should happen very rarely. 2) >> user does know the actual number of commands that succeed. >> >> So, do we leave it as is? Or is there any way to improve? >> >> One tiny improvement (not a complete fix) in my mind is a testing copy_to_user >> before we execute the commands. If it succeeds, it's even less likely the last >> copy_to_user could fail, so that we can even probably assert it won't. The >> testing 'copy_to_user(cmds, &kcmds, ...)' will not hurt functionality if do it >> right after 'copy_from_user(&kcmds, cmds, ...)'. But I'm not sure about the >> performance impact, especially when @ncmds is big. >> > I don't think this is the right thing to do, since, for example, another > thread could unmap the memory region holding buffer between the "check" > copy_to_user and the actual one. > > The two alternatives that I see are: > > 1. If copy_to_user fails, ignore it and return the number of commands > that succeeded (i.e., what the code in your patch does). > 2. If copy_to_user fails, return -EFAULT, regardless of how many > commands succeeded. > > The problem with 1 is that it could potentially mask bugs in a user > program. You could imagine a buggy program that passes a read-only > buffer to epoll_ctl_batch and never finds out about it because they > don't get the error. (Then, when there's a real error in one of the > epoll_ctl_cmds, but .result is 0, someone will be very confused.) > > So I think 2 is the better option. Sure, the user will have no idea how > many commands were executed, but when would EFAULT on this system call > be part of normal operation, anyways? You'll find the memory bug, fix > it, and rest easy knowing that nothing is silently failing behind your > back. What Omar says makes sense to me too. Best to have the user get a clear error indication for this case. Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/