Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1031609AbdIZSnP (ORCPT ); Tue, 26 Sep 2017 14:43:15 -0400 Received: from mail-wm0-f66.google.com ([74.125.82.66]:36773 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751692AbdIZSnN (ORCPT ); Tue, 26 Sep 2017 14:43:13 -0400 X-Google-Smtp-Source: AOwi7QDlD4dwmR3vfJw+4yPZAiThw7FFxCe8Lcu8MZR2LOlSUhTGgdsvrs5hYETMWtYEMWTcTjhbYQ== Date: Tue, 26 Sep 2017 21:43:08 +0300 From: Alexey Dobriyan To: Andy Lutomirski Cc: Andrew Morton , "linux-kernel@vger.kernel.org" , Linux API , Randy Dunlap , Thomas Gleixner , Djalal Harouni , Alexey Gladkov , Tatsiana_Brouka@epam.com, Aliaksandr_Patseyenak1@epam.com Subject: Re: [PATCH 1/2 v2] fdmap(2) Message-ID: <20170926184308.GB14724@avx2> References: <20170924200620.GA24368@avx2> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.7.2 (2016-11-26) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2384 Lines: 49 On Sun, Sep 24, 2017 at 02:31:23PM -0700, Andy Lutomirski wrote: > On Sun, Sep 24, 2017 at 1:06 PM, Alexey Dobriyan wrote: > > From: Aliaksandr Patseyenak > > > > Implement system call for bulk retrieveing of opened descriptors > > in binary form. > > > > Some daemons could use it to reliably close file descriptors > > before starting. Currently they close everything upto some number > > which formally is not reliable. Other natural users are lsof(1) and CRIU > > (although lsof does so much in /proc that the effect is thoroughly buried). > > > > /proc, the only way to learn anything about file descriptors may not be > > available. There is unavoidable overhead associated with instantiating > > 3 dentries and 3 inodes and converting integers to strings and back. > > > > Benchmark: > > > > N=1<<22 times > > 4 opened descriptors (0, 1, 2, 3) > > opendir+readdir+closedir /proc/self/fd vs fdmap > > > > /proc 8.31 ą 0.37% > > fdmap 0.32 ą 0.72% > > This doesn't have the semantic problem that pidmap does, but I still > wonder why this can't be accomplished by adding a new file in /proc. It can be done in /proc but the point of the exercise is to skip all the overhead: in this case dcache, 1 descriptor for readdir, conversion from binary to string. The problem is much deeper, namely, EIATF people force everyone else to cater to Unix shells so that they can do read() on them because Unix shells can't do system calls like real programming languages. The only way to fix this problem is to ignore Unix shells and start introducing binary system calls so that normal people aren't forced to make their programs slower than necessary. Example: lsof(1) does close() from 3 to 1023 inclusive on startup. I don't know why but it does it. 1 syscall = 1 us, 1000 syscalls = 1 ms wasted because all of them will return -EBADF normally. With fdmap(2), lsof would do 2 fdmap() calls (1 real + 1 to confirm no more descriptors are available + 0 closes in normal situation). That's 2 syscalls vs 1020. Obviously, for binary model to work fdmap(2) needs to be complemented by other system calls all of which will bypass /proc for, say, extracting /proc/$PID/fd/$i symlink content and fdinfo. Currently, if you use fdmap(2) you still have to fish in /proc for the rest of the data.