Date: Fri, 17 Apr 2015 00:00:03 +0200
From: Mateusz Guzik <mguzik@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Yann Droneaud <ydroneaud@opteya.com>,
        Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] fs: use a sequence counter instead of file_lock in
 fd_install
Message-ID: <20150416220002.GB20615@mguzik>
References: <20150416121628.GA20615@mguzik>
 <20150416180932.GW889@ZenIV.linux.org.uk>
 <1429216923.7346.211.camel@edumazet-glaptop2.roam.corp.google.com>
 <1429217739.7346.218.camel@edumazet-glaptop2.roam.corp.google.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <1429217739.7346.218.camel@edumazet-glaptop2.roam.corp.google.com>
User-Agent: Mutt/1.5.23.1-rc1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2925
Lines: 82

On Thu, Apr 16, 2015 at 01:55:39PM -0700, Eric Dumazet wrote:
> On Thu, 2015-04-16 at 13:42 -0700, Eric Dumazet wrote:
> > On Thu, 2015-04-16 at 19:09 +0100, Al Viro wrote:
> > > On Thu, Apr 16, 2015 at 02:16:31PM +0200, Mateusz Guzik wrote:
> > > > @@ -165,8 +165,10 @@ static int expand_fdtable(struct files_struct *files, int nr)
> > > >  	cur_fdt = files_fdtable(files);
> > > >  	if (nr >= cur_fdt->max_fds) {
> > > >  		/* Continue as planned */
> > > > +		write_seqcount_begin(&files->fdt_seqcount);
> > > >  		copy_fdtable(new_fdt, cur_fdt);
> > > >  		rcu_assign_pointer(files->fdt, new_fdt);
> > > > +		write_seqcount_end(&files->fdt_seqcount);
> > > >  		if (cur_fdt != &files->fdtab)
> > > >  			call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
> > > 
> > > Interesting.  AFAICS, your test doesn't step anywhere near that path,
> > > does it?  So basically you never hit the retries during that...
> > 
> > Right, but then the table is almost never changed for a given process,
> > as we only increase it by power of two steps.
> > 
> > (So I scratch my initial comment, fdt_seqcount is really mostly read)
> 
> I tested Mateusz patch with my opensock program, mimicking a bit more
> what a server does (having lot of sockets)
> 
> 24 threads running, doing close(randomfd())/socket() calls like crazy.
> 
> Before patch :
> 
> # time ./opensock 
> 
> real	0m10.863s
> user	0m0.954s
> sys	2m43.659s
> 
> 
> After patch :
> 
> # time ./opensock
> 
> real	0m9.750s
> user	0m0.804s
> sys	2m18.034s
> 
> So this is an improvement for sure, but not massive.
> 
> perf record ./opensock ; report
> 
>     87.80%  opensock  [kernel.kallsyms]  [k] _raw_spin_lock                     
>                |--52.70%-- __close_fd
>                |--46.41%-- __alloc_fd

My crap benchmark is here: http://people.redhat.com/~mguzik/pipebench.c
(compile with -pthread, run with -s 10 -n 16 for 10 second test + 16
threads)

As noted earlier it tends to go from rougly 300k ops/s to 400.

The fundamental problem here seems to be this pesky POSIX requirement of
providing the lowest possible fd on each allocation (as a side note
Linux breaks this with parallel fd allocs, where one of these backs off
the reservation, not that I believe this causes trouble).

Ideally a process-wide switch could be implemented (e.g.
prctl(SCRATCH_LOWEST_FD_REQ)) which would grant the kernel the freedom
to return any fd it wants, so it would be possible to have fd ranges
per thread and the like.

Having only a O_SCRATCH_POSIX flag passed to syscalls would still leave
close() as a bottleneck.

In the meantime I consider the approach taken in my patch as an ok
temporary improvement.

-- 
Mateusz Guzik
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/