Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753346AbdI1PCr (ORCPT ); Thu, 28 Sep 2017 11:02:47 -0400 Received: from mail.kernel.org ([198.145.29.99]:33106 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753212AbdI1PCp (ORCPT ); Thu, 28 Sep 2017 11:02:45 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E405C218CC Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org X-Google-Smtp-Source: AOwi7QBFDO4GccdmCDCFaBDCPlUAnCZoZDU/ObAVF10Y1brsxUeYWKW7su+Lw6ENoAMykfcu2Keo+4ZmScdtTCMOb0s= MIME-Version: 1.0 In-Reply-To: References: <20170924200620.GA24368@avx2> <9bc11ace-d111-cdef-5280-8cdda027ae9a@gmail.com> <20170926190018.GA30898@avx2> From: Andy Lutomirski Date: Thu, 28 Sep 2017 08:02:23 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 1/2 v2] fdmap(2) To: Alexey Dobriyan Cc: Michael Kerrisk , Andrew Morton , "linux-kernel@vger.kernel.org" , Linux API , Randy Dunlap , Thomas Gleixner , Djalal Harouni , Alexey Gladkov , Aliaksandr Patseyenak , Tatsiana Brouka Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3775 Lines: 110 On Thu, Sep 28, 2017 at 3:55 AM, Alexey Dobriyan wrote: > On 9/28/17, Michael Kerrisk (man-pages) wrote: >> On 27 September 2017 at 17:03, Andy Lutomirski wrote: > >>>> The idea is to start process. In ideal world, only bynary system calls >>>> would exist and shells could emulate /proc/* same way bash implement >>>> /dev/tcp >>> >>> Then start the process by doing it for real and making it obviously >>> useful. We should not add a pair of vaguely useful, rather weak >>> syscalls just to start a process of modernizing /proc. > > Before doing it for real it would be nice to have at least a nod > from people in charge that syscalls which return binary > information are OK. Otherwise some EIATF guy will just say > "NAK /proc is fine, it always was fine". There's nothing inherently wrong with syscalls that return binary information. There is something wrong with reinventing the world with insufficient justification, though. /proc may be old, clunky, and kind of slow, but it has a lot of good things going for it. It supports access control (DAC and MAC). It handles namespacing in a way that's awkward but supported by all the existing namespace managers. It may soon support mount options, which is rather important. I feel like we've been discussing this performance issue for over a year, and I distinctly recall discussing it in Santa Fe. I suggested a two-pronged approach: 1. Add a new syscall that will, in a single call, open, read, and close a proc file and maybe even a regular non-proc file. Like this: long readfileat(int dirfd, const char *path, void *buf, size_t len, int flags); 2. Where needed, add new /proc files with lighter-weight representations. I think we discussed something that's like status but in nl_attr format. This doesn't end up with a bunch of code duplication the way that a full-blown syscall-based reimplementation would. It supports all the /proc features rather than just a subset. It fully respects access control, future mount options, and the potential lack of a /proc mount. > > Or look from another angle: sched_setaffinity exists but there is > no /proc counterpart, shells must use taskset(1) and world didn't end. sched_setaffinity() modifies the caller. /proc wouldn't have made much sense. > >> I concur. >> >> Alexey, you still have not wxplained who specifically needs this >> right now, and how, precisely, they plan to use the new system calls. >> It is all very arm-wavey so far. > > It is not if you read even example program in the original patch. > Any program which queries information about file descriptors > will benefit both in CPU and memory usage. > > void closefrom(int start) > { > int fd[1024]; > int n; > > while ((n = fdmap(0, fd, sizeof(fd)/sizeof(fd[0]), start)) > 0) { > unsigned int i; > > for (i = 0; i < n; i++) > close(fd[i]); > > start = fd[n - 1] + 1; > } > } > > CRIU naturally to know everything about descriptors of target processes: > It does: > > int predump_task_files(int pid) > { > struct dirent *de; > DIR *fd_dir; > int ret = -1; > > pr_info("Pre-dump fds for %d)\n", pid); > > fd_dir = opendir_proc(pid, "fd"); > if (!fd_dir) > return -1; > > while ((de = readdir(fd_dir))) { > if (dir_dots(de)) > continue; > > if (predump_one_fd(pid, atoi(de->d_name))) > goto out; > } > > ret = 0; > out: > closedir(fd_dir); > return ret; > } > > which is again inefficient. And /proc/PID/fds would solve this.