Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752890AbbDPWwd (ORCPT ); Thu, 16 Apr 2015 18:52:33 -0400 Received: from mail-ig0-f176.google.com ([209.85.213.176]:36592 "EHLO mail-ig0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751228AbbDPWwX (ORCPT ); Thu, 16 Apr 2015 18:52:23 -0400 Message-ID: <1429224740.7346.225.camel@edumazet-glaptop2.roam.corp.google.com> Subject: Re: [RFC PATCH] fs: use a sequence counter instead of file_lock in fd_install From: Eric Dumazet To: Mateusz Guzik Cc: Al Viro , Andrew Morton , "Paul E. McKenney" , Yann Droneaud , Konstantin Khlebnikov , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Date: Thu, 16 Apr 2015 15:52:20 -0700 In-Reply-To: <20150416220002.GB20615@mguzik> References: <20150416121628.GA20615@mguzik> <20150416180932.GW889@ZenIV.linux.org.uk> <1429216923.7346.211.camel@edumazet-glaptop2.roam.corp.google.com> <1429217739.7346.218.camel@edumazet-glaptop2.roam.corp.google.com> <20150416220002.GB20615@mguzik> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3650 Lines: 101 On Fri, 2015-04-17 at 00:00 +0200, Mateusz Guzik wrote: > On Thu, Apr 16, 2015 at 01:55:39PM -0700, Eric Dumazet wrote: > > On Thu, 2015-04-16 at 13:42 -0700, Eric Dumazet wrote: > > > On Thu, 2015-04-16 at 19:09 +0100, Al Viro wrote: > > > > On Thu, Apr 16, 2015 at 02:16:31PM +0200, Mateusz Guzik wrote: > > > > > @@ -165,8 +165,10 @@ static int expand_fdtable(struct files_struct *files, int nr) > > > > > cur_fdt = files_fdtable(files); > > > > > if (nr >= cur_fdt->max_fds) { > > > > > /* Continue as planned */ > > > > > + write_seqcount_begin(&files->fdt_seqcount); > > > > > copy_fdtable(new_fdt, cur_fdt); > > > > > rcu_assign_pointer(files->fdt, new_fdt); > > > > > + write_seqcount_end(&files->fdt_seqcount); > > > > > if (cur_fdt != &files->fdtab) > > > > > call_rcu(&cur_fdt->rcu, free_fdtable_rcu); > > > > > > > > Interesting. AFAICS, your test doesn't step anywhere near that path, > > > > does it? So basically you never hit the retries during that... > > > > > > Right, but then the table is almost never changed for a given process, > > > as we only increase it by power of two steps. > > > > > > (So I scratch my initial comment, fdt_seqcount is really mostly read) > > > > I tested Mateusz patch with my opensock program, mimicking a bit more > > what a server does (having lot of sockets) > > > > 24 threads running, doing close(randomfd())/socket() calls like crazy. > > > > Before patch : > > > > # time ./opensock > > > > real 0m10.863s > > user 0m0.954s > > sys 2m43.659s > > > > > > After patch : > > > > # time ./opensock > > > > real 0m9.750s > > user 0m0.804s > > sys 2m18.034s > > > > So this is an improvement for sure, but not massive. > > > > perf record ./opensock ; report > > > > 87.80% opensock [kernel.kallsyms] [k] _raw_spin_lock > > |--52.70%-- __close_fd > > |--46.41%-- __alloc_fd > > My crap benchmark is here: http://people.redhat.com/~mguzik/pipebench.c > (compile with -pthread, run with -s 10 -n 16 for 10 second test + 16 > threads) > > As noted earlier it tends to go from rougly 300k ops/s to 400. > > The fundamental problem here seems to be this pesky POSIX requirement of > providing the lowest possible fd on each allocation (as a side note > Linux breaks this with parallel fd allocs, where one of these backs off > the reservation, not that I believe this causes trouble). Note POSIX never talked about multi threads. The POSIX requirement came from traditional linux stdin/stdout/stderr handling and legacy programs, before dup2() even existed. > > Ideally a process-wide switch could be implemented (e.g. > prctl(SCRATCH_LOWEST_FD_REQ)) which would grant the kernel the freedom > to return any fd it wants, so it would be possible to have fd ranges > per thread and the like. I played months ago with a SOCK_FD_FASTALLOC ;) idea was to use a random starting point instead of 0. But the bottleneck was really the spinlock, not the bit search, unless I used 10 million fds in the program... > > Having only a O_SCRATCH_POSIX flag passed to syscalls would still leave > close() as a bottleneck. > > In the meantime I consider the approach taken in my patch as an ok > temporary improvement. Yes please formally submit this patch. Note that adding atomic bit operations could eventually allow to not hold the spinlock at close() time. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/