2017-06-01 11:41:27

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On Thu, 2017-06-01 at 10:05 +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed a -14.1% regression of will-it-scale.per_process_ops due to commit:
>
>
> commit: 9d21d181d06acab9a8e80eac2ec4eed77b656793 ("fs/locks: Set fl_nspid at file_lock allocation")
> url: https://github.com/0day-ci/linux/commits/Benjamin-Coddington/fs-locks-Alloc-file_lock-where-practical/20170527-050700
>
>

Ouch, that's a rather nasty performance hit. In hindsight, maybe we
shouldn't move those off the stack after all? Heck, if it's that
significant, maybe we should move the F_SETLK callers to allocate these
on the stack as well?

> in testcase: will-it-scale
> on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 4G memory
> with following parameters:
>
> test: lock1
> cpufreq_governor: performance
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+----------------------------------------------------------------+
> > testcase: change | will-it-scale: will-it-scale.per_process_ops -4.9% regression |
> > test machine | 16 threads Intel(R) Atom(R) CPU 3958 @ 2.00GHz with 64G memory |
> > test parameters | cpufreq_governor=performance |
> > | mode=process |
> > | nr_task=100% |
> > | test=lock1 |
>
> +------------------+----------------------------------------------------------------+
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/01org/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> testcase/path_params/tbox_group/run: will-it-scale/lock1-performance/lkp-ivb-d04
>
> 09790e423b32fba4 9d21d181d06acab9a8e80eac2e
> ---------------- --------------------------
> 0.51 19% 0.60 ± 7% will-it-scale.scalability
> 2462089 -14% 2114597 will-it-scale.per_process_ops
> 2195246 -26% 1631578 will-it-scale.per_thread_ops
> 350 356 will-it-scale.time.system_time
> 28.89 -24% 22.06 will-it-scale.time.user_time
> 32.78 31.97 turbostat.PkgWatt
> 15.58 -5% 14.80 turbostat.CorWatt
> 19284 18803 vmstat.system.in
> 32208 -4% 31052 vmstat.system.cs
> 1630 ±173% 2e+04 18278 ± 27% latency_stats.avg.perf_event_alloc.SYSC_perf_event_open.SyS_perf_event_open.entry_SYSCALL_64_fastpath
> 1630 ±173% 2e+04 18278 ± 27% latency_stats.max.perf_event_alloc.SYSC_perf_event_open.SyS_perf_event_open.entry_SYSCALL_64_fastpath
> 1630 ±173% 2e+04 18278 ± 27% latency_stats.sum.perf_event_alloc.SYSC_perf_event_open.SyS_perf_event_open.entry_SYSCALL_64_fastpath
> 1.911e+09 ± 6% 163% 5.022e+09 ± 5% perf-stat.cache-references
> 27.58 ± 12% 17% 32.14 ± 7% perf-stat.iTLB-load-miss-rate%
> 9881103 -4% 9527607 perf-stat.context-switches
> 9.567e+11 ± 9% -14% 8.181e+11 ± 9% perf-stat.dTLB-loads
> 6.85e+11 ± 4% -16% 5.761e+11 ± 6% perf-stat.branch-instructions
> 3.469e+12 ± 4% -17% 2.893e+12 ± 6% perf-stat.instructions
> 1.24 ± 4% -19% 1.00 perf-stat.ipc
> 3.18 ± 8% -62% 1.19 ± 19% perf-stat.cache-miss-rate%
>
>
>
> perf-stat.cache-references
>
> 8e+09 ++------------------------------------------------------------------+
> | |
> 7e+09 ++ O O |
> | O |
> 6e+09 ++ O |
> | O |
> 5e+09 ++O O O O O O O
> O O O O O O O O O O O O O O O O O O |
> 4e+09 ++ O O O |
> | |
> 3e+09 ++ |
> | *. *.. *. |
> 2e+09 *+ + *.. .*. + *. .*. + *. .*.*. .*.*. .*..* |
> | * *.*.*.*.* * * * *.*. *.* * |
> 1e+09 ++------------------------------------------------------------------+
>
>
> will-it-scale.time.user_time
>
> 30 ++--*-------------------*-----------*----------------------------------+
> 29 *+* *.*.*.*..*.*.*.* *.*.*.*. *.*. *. .*. .*.*.* |
> | *. .. * *.*. |
> 28 ++ * |
> 27 ++ |
> | |
> 26 ++ |
> 25 ++ |
> 24 ++ |
> | |
> 23 ++ O O O O O O O O O O O O O |
> 22 O+O O O O O O O O O O O O O O |
> | O O O O
> 21 ++ O |
> 20 ++---------------------------------------------------------------------+
>
>
> will-it-scale.time.system_time
>
> 358 ++--------------------------------------------------------------------+
> 357 O+O O O O O O O O
> | O O O O O O O O O O O O O O O O O O |
> 356 ++ O O O O O O |
> 355 ++ |
> | |
> 354 ++ |
> 353 ++ |
> 352 ++ |
> | |
> 351 ++ *. .*. .*. |
> 350 *+*. .* * .*.*.*. .* *.*. .* + * *..* *.*.* |
> | *. + + + .*. * + .. * + .* |
> 349 ++ * * * *. |
> 348 ++--------------------------------------------------------------------+
>
>
> will-it-scale.per_thread_ops
>
> 2.3e+06 ++----------------------------------------------------------------+
> | |
> 2.2e+06 ++*.*. .*. .*..*. .*.*. .*.*. .*.*..*.*.*.*.* |
> * *.* * * *.*.* *.*. .*.* |
> 2.1e+06 ++ * |
> 2e+06 ++ |
> | |
> 1.9e+06 ++ |
> | |
> 1.8e+06 ++ |
> 1.7e+06 ++ O O O O O O |
> O O O O O O O O O O |
> 1.6e+06 ++ O O O O O O O O O O O O O O O
> | O O |
> 1.5e+06 ++----------------------------------------------------------------+
>
> [*] bisect-good sample
> [O] bisect-bad sample
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Xiaolong

--
Jeff Layton <[email protected]>


2017-06-01 11:49:45

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On 1 Jun 2017, at 7:41, Jeff Layton wrote:

> On Thu, 2017-06-01 at 10:05 +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed a -14.1% regression of will-it-scale.per_process_ops
>> due to commit:
>>
>>
>> commit: 9d21d181d06acab9a8e80eac2ec4eed77b656793 ("fs/locks: Set
>> fl_nspid at file_lock allocation")
>> url:
>> https://github.com/0day-ci/linux/commits/Benjamin-Coddington/fs-locks-Alloc-file_lock-where-practical/20170527-050700
>>
>>
>
> Ouch, that's a rather nasty performance hit. In hindsight, maybe we
> shouldn't move those off the stack after all? Heck, if it's that
> significant, maybe we should move the F_SETLK callers to allocate
> these
> on the stack as well?

We can do that. But, I think this is picking up the
locks_mandatory_area()
allocation which is now removed. The attached config has
CONFIG_MANDATORY_FILE_LOCKING=y, so there's allocation on every
read/write.

Ben

2017-06-01 12:59:24

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On Thu, 2017-06-01 at 07:49 -0400, Benjamin Coddington wrote:
> On 1 Jun 2017, at 7:41, Jeff Layton wrote:
>
> > On Thu, 2017-06-01 at 10:05 +0800, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed a -14.1% regression of will-it-scale.per_process_ops
> > > due to commit:
> > >
> > >
> > > commit: 9d21d181d06acab9a8e80eac2ec4eed77b656793 ("fs/locks: Set
> > > fl_nspid at file_lock allocation")
> > > url:
> > > https://github.com/0day-ci/linux/commits/Benjamin-Coddington/fs-locks-Alloc-file_lock-where-practical/20170527-050700
> > >
> > >
> >
> > Ouch, that's a rather nasty performance hit. In hindsight, maybe we
> > shouldn't move those off the stack after all? Heck, if it's that
> > significant, maybe we should move the F_SETLK callers to allocate
> > these
> > on the stack as well?
>
> We can do that. But, I think this is picking up the
> locks_mandatory_area()
> allocation which is now removed. The attached config has
> CONFIG_MANDATORY_FILE_LOCKING=y, so there's allocation on every
> read/write.
>

I'm not so sure. That would only be the case if the thing were marked
for manadatory locking (a really rare thing).

The test is really simple and I don't think any read/write activity is
involved:

https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c

...and the 0 day bisected it down to this patch, IIUC:

https://github.com/0day-ci/linux/commit/9d21d181d06acab9a8e80eac2ec4eed77b656793

It seems likely that it's the extra get_pid/put_pid in the allocation
and free codepath. I expected those to be pretty cheap, but maybe
they're not?
--
Jeff Layton <[email protected]>

2017-06-01 15:14:16

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On Thu, Jun 01, 2017 at 08:59:21AM -0400, Jeff Layton wrote:
> On Thu, 2017-06-01 at 07:49 -0400, Benjamin Coddington wrote:
> > On 1 Jun 2017, at 7:41, Jeff Layton wrote:
> >
> > > On Thu, 2017-06-01 at 10:05 +0800, kernel test robot wrote:
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -14.1% regression of will-it-scale.per_process_ops
> > > > due to commit:
> > > >
> > > >
> > > > commit: 9d21d181d06acab9a8e80eac2ec4eed77b656793 ("fs/locks: Set
> > > > fl_nspid at file_lock allocation")
> > > > url:
> > > > https://github.com/0day-ci/linux/commits/Benjamin-Coddington/fs-locks-Alloc-file_lock-where-practical/20170527-050700
> > > >
> > > >
> > >
> > > Ouch, that's a rather nasty performance hit. In hindsight, maybe we
> > > shouldn't move those off the stack after all? Heck, if it's that
> > > significant, maybe we should move the F_SETLK callers to allocate
> > > these
> > > on the stack as well?
> >
> > We can do that. But, I think this is picking up the
> > locks_mandatory_area()
> > allocation which is now removed. The attached config has
> > CONFIG_MANDATORY_FILE_LOCKING=y, so there's allocation on every
> > read/write.
> >
>
> I'm not so sure. That would only be the case if the thing were marked
> for manadatory locking (a really rare thing).
>
> The test is really simple and I don't think any read/write activity is
> involved:
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c

So it's just F_WRLCK/F_UNLCK in a loop spread across multiple cores?
I'd think real workloads do some work while holding the lock, and a 15%
regression on just the pure lock/unlock loop might not matter? But best
to be careful, I guess.

--b.

>
> ...and the 0 day bisected it down to this patch, IIUC:
>
> https://github.com/0day-ci/linux/commit/9d21d181d06acab9a8e80eac2ec4eed77b656793
>
> It seems likely that it's the extra get_pid/put_pid in the allocation
> and free codepath. I expected those to be pretty cheap, but maybe
> they're not?



2017-06-01 15:49:09

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On Thu, 2017-06-01 at 11:14 -0400, J. Bruce Fields wrote:
> On Thu, Jun 01, 2017 at 08:59:21AM -0400, Jeff Layton wrote:
> > On Thu, 2017-06-01 at 07:49 -0400, Benjamin Coddington wrote:
> > > On 1 Jun 2017, at 7:41, Jeff Layton wrote:
> > >
> > > > On Thu, 2017-06-01 at 10:05 +0800, kernel test robot wrote:
> > > > > Greeting,
> > > > >
> > > > > FYI, we noticed a -14.1% regression of will-it-scale.per_process_ops
> > > > > due to commit:
> > > > >
> > > > >
> > > > > commit: 9d21d181d06acab9a8e80eac2ec4eed77b656793 ("fs/locks: Set
> > > > > fl_nspid at file_lock allocation")
> > > > > url:
> > > > > https://github.com/0day-ci/linux/commits/Benjamin-Coddington/fs-locks-Alloc-file_lock-where-practical/20170527-050700
> > > > >
> > > > >
> > > >
> > > > Ouch, that's a rather nasty performance hit. In hindsight, maybe we
> > > > shouldn't move those off the stack after all? Heck, if it's that
> > > > significant, maybe we should move the F_SETLK callers to allocate
> > > > these
> > > > on the stack as well?
> > >
> > > We can do that. But, I think this is picking up the
> > > locks_mandatory_area()
> > > allocation which is now removed. The attached config has
> > > CONFIG_MANDATORY_FILE_LOCKING=y, so there's allocation on every
> > > read/write.
> > >
> >
> > I'm not so sure. That would only be the case if the thing were marked
> > for manadatory locking (a really rare thing).
> >
> > The test is really simple and I don't think any read/write activity is
> > involved:
> >
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c
>
> So it's just F_WRLCK/F_UNLCK in a loop spread across multiple cores?
> I'd think real workloads do some work while holding the lock, and a 15%
> regression on just the pure lock/unlock loop might not matter? But best
> to be careful, I guess.
>
> --b.
>

Yeah, that's my take.

I was assuming that getting a pid reference would be essentially free,
but it doesn't seem to be.

So, I think we probably want to avoid taking it for a file_lock that we
use to request a lock, but do take it for a file_lock that is used to
record a lock. How best to code that up, I'm not quite sure...

> >
> > ...and the 0 day bisected it down to this patch, IIUC:
> >
> > https://github.com/0day-ci/linux/commit/9d21d181d06acab9a8e80eac2ec4eed77b656793
> >
> > It seems likely that it's the extra get_pid/put_pid in the allocation
> > and free codepath. I expected those to be pretty cheap, but maybe
> > they're not?
>
>

--
Jeff Layton <[email protected]>

2017-06-05 18:34:22

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On 1 Jun 2017, at 11:48, Jeff Layton wrote:

> On Thu, 2017-06-01 at 11:14 -0400, J. Bruce Fields wrote:
>> On Thu, Jun 01, 2017 at 08:59:21AM -0400, Jeff Layton wrote:
>>> I'm not so sure. That would only be the case if the thing were
>>> marked
>>> for manadatory locking (a really rare thing).
>>>
>>> The test is really simple and I don't think any read/write activity
>>> is
>>> involved:
>>>
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c
>>
>> So it's just F_WRLCK/F_UNLCK in a loop spread across multiple cores?
>> I'd think real workloads do some work while holding the lock, and a
>> 15%
>> regression on just the pure lock/unlock loop might not matter? But
>> best
>> to be careful, I guess.
>>
>> --b.
>>
>
> Yeah, that's my take.
>
> I was assuming that getting a pid reference would be essentially free,
> but it doesn't seem to be.
>
> So, I think we probably want to avoid taking it for a file_lock that
> we
> use to request a lock, but do take it for a file_lock that is used to
> record a lock. How best to code that up, I'm not quite sure...

Maybe as simple as only setting fl_nspid in locks_insert_lock_ctx(), but
that seems to just take us back to the problem of getting the pid wrong
if
the lock is inserted later by a different worker than created the
request.

I have a mind now to just drop fl_nspid off the struct file_lock
completely,
and instead just carry fl_pid, and when we do F_GETLK, we can do:

task = find_task_by_pid_ns(fl_pid, init_pid_ns)
fl_nspid = task_pid_nr_ns(task, task_active_pid_ns(current))

That moves all the work off into the F_GETLK case, which I think is not
used
so much.

Ben

2017-06-05 22:02:45

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On Mon, 2017-06-05 at 14:34 -0400, Benjamin Coddington wrote:
> On 1 Jun 2017, at 11:48, Jeff Layton wrote:
>
> > On Thu, 2017-06-01 at 11:14 -0400, J. Bruce Fields wrote:
> > > On Thu, Jun 01, 2017 at 08:59:21AM -0400, Jeff Layton wrote:
> > > > I'm not so sure. That would only be the case if the thing were
> > > > marked
> > > > for manadatory locking (a really rare thing).
> > > >
> > > > The test is really simple and I don't think any read/write activity
> > > > is
> > > > involved:
> > > >
> > > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c
> > >
> > > So it's just F_WRLCK/F_UNLCK in a loop spread across multiple cores?
> > > I'd think real workloads do some work while holding the lock, and a
> > > 15%
> > > regression on just the pure lock/unlock loop might not matter? But
> > > best
> > > to be careful, I guess.
> > >
> > > --b.
> > >
> >
> > Yeah, that's my take.
> >
> > I was assuming that getting a pid reference would be essentially free,
> > but it doesn't seem to be.
> >
> > So, I think we probably want to avoid taking it for a file_lock that
> > we
> > use to request a lock, but do take it for a file_lock that is used to
> > record a lock. How best to code that up, I'm not quite sure...
>
> Maybe as simple as only setting fl_nspid in locks_insert_lock_ctx(), but
> that seems to just take us back to the problem of getting the pid wrong
> if
> the lock is inserted later by a different worker than created the
> request.
>
> I have a mind now to just drop fl_nspid off the struct file_lock
> completely,
> and instead just carry fl_pid, and when we do F_GETLK, we can do:
>
> task = find_task_by_pid_ns(fl_pid, init_pid_ns)
> fl_nspid = task_pid_nr_ns(task, task_active_pid_ns(current))
>
> That moves all the work off into the F_GETLK case, which I think is not
> used
> so much.
>

Actually I think what might work best is to:

- have locks_copy_conflock also copy the fl_nspid and take a reference
to it (as your patch #2 does)

- only set fl_nspid and take a reference there in locks_insert_lock_ctx
if it's not already set

- allow ->lock operations (like nfs) to set fl_nspid before they call
locks_lock_inode_wait to set the local lock. Might need to take a nspid
reference before dispatching an RPC so that you get the right thread
context.

Would that work?
--
Jeff Layton <[email protected]>

2017-06-06 13:01:02

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression



On 5 Jun 2017, at 18:02, Jeff Layton wrote:

> On Mon, 2017-06-05 at 14:34 -0400, Benjamin Coddington wrote:
>> On 1 Jun 2017, at 11:48, Jeff Layton wrote:
>>
>>> On Thu, 2017-06-01 at 11:14 -0400, J. Bruce Fields wrote:
>>>> On Thu, Jun 01, 2017 at 08:59:21AM -0400, Jeff Layton wrote:
>>>>> I'm not so sure. That would only be the case if the thing were
>>>>> marked
>>>>> for manadatory locking (a really rare thing).
>>>>>
>>>>> The test is really simple and I don't think any read/write
>>>>> activity
>>>>> is
>>>>> involved:
>>>>>
>>>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c
>>>>
>>>> So it's just F_WRLCK/F_UNLCK in a loop spread across multiple
>>>> cores?
>>>> I'd think real workloads do some work while holding the lock, and a
>>>> 15%
>>>> regression on just the pure lock/unlock loop might not matter? But
>>>> best
>>>> to be careful, I guess.
>>>>
>>>> --b.
>>>>
>>>
>>> Yeah, that's my take.
>>>
>>> I was assuming that getting a pid reference would be essentially
>>> free,
>>> but it doesn't seem to be.
>>>
>>> So, I think we probably want to avoid taking it for a file_lock that
>>> we
>>> use to request a lock, but do take it for a file_lock that is used
>>> to
>>> record a lock. How best to code that up, I'm not quite sure...
>>
>> Maybe as simple as only setting fl_nspid in locks_insert_lock_ctx(),
>> but
>> that seems to just take us back to the problem of getting the pid
>> wrong
>> if
>> the lock is inserted later by a different worker than created the
>> request.
>>
>> I have a mind now to just drop fl_nspid off the struct file_lock
>> completely,
>> and instead just carry fl_pid, and when we do F_GETLK, we can do:
>>
>> task = find_task_by_pid_ns(fl_pid, init_pid_ns)
>> fl_nspid = task_pid_nr_ns(task, task_active_pid_ns(current))
>>
>> That moves all the work off into the F_GETLK case, which I think is
>> not
>> used
>> so much.
>>
>
> Actually I think what might work best is to:
>
> - have locks_copy_conflock also copy the fl_nspid and take a reference
> to it (as your patch #2 does)
>
> - only set fl_nspid and take a reference there in
> locks_insert_lock_ctx
> if it's not already set
>
> - allow ->lock operations (like nfs) to set fl_nspid before they call
> locks_lock_inode_wait to set the local lock. Might need to take a
> nspid
> reference before dispatching an RPC so that you get the right thread
> context.

It would, but I think fl_nspid is completely unnecessary. The reason we
have it so that we can translate the pid number into other namespaces,
the
most common case being that F_GETLK and views of /proc/locks within a
namespace represent the same pid numbers as the processes in that
namespace
that are holding the locks.

It is much simpler to just keep using fl_pid as the pid number in the
init
namespace, but move the translation of that pid number to lookup time,
rather than creation time.

Ben

2017-06-06 13:15:28

by Jeff Layton

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On Tue, 2017-06-06 at 09:00 -0400, Benjamin Coddington wrote:
>
> On 5 Jun 2017, at 18:02, Jeff Layton wrote:
>
> > On Mon, 2017-06-05 at 14:34 -0400, Benjamin Coddington wrote:
> > > On 1 Jun 2017, at 11:48, Jeff Layton wrote:
> > >
> > > > On Thu, 2017-06-01 at 11:14 -0400, J. Bruce Fields wrote:
> > > > > On Thu, Jun 01, 2017 at 08:59:21AM -0400, Jeff Layton wrote:
> > > > > > I'm not so sure. That would only be the case if the thing were
> > > > > > marked
> > > > > > for manadatory locking (a really rare thing).
> > > > > >
> > > > > > The test is really simple and I don't think any read/write
> > > > > > activity
> > > > > > is
> > > > > > involved:
> > > > > >
> > > > > > https://github.com/antonblanchard/will-it-scale/blob/master/tests/lock1.c
> > > > >
> > > > > So it's just F_WRLCK/F_UNLCK in a loop spread across multiple
> > > > > cores?
> > > > > I'd think real workloads do some work while holding the lock, and a
> > > > > 15%
> > > > > regression on just the pure lock/unlock loop might not matter? But
> > > > > best
> > > > > to be careful, I guess.
> > > > >
> > > > > --b.
> > > > >
> > > >
> > > > Yeah, that's my take.
> > > >
> > > > I was assuming that getting a pid reference would be essentially
> > > > free,
> > > > but it doesn't seem to be.
> > > >
> > > > So, I think we probably want to avoid taking it for a file_lock that
> > > > we
> > > > use to request a lock, but do take it for a file_lock that is used
> > > > to
> > > > record a lock. How best to code that up, I'm not quite sure...
> > >
> > > Maybe as simple as only setting fl_nspid in locks_insert_lock_ctx(),
> > > but
> > > that seems to just take us back to the problem of getting the pid
> > > wrong
> > > if
> > > the lock is inserted later by a different worker than created the
> > > request.
> > >
> > > I have a mind now to just drop fl_nspid off the struct file_lock
> > > completely,
> > > and instead just carry fl_pid, and when we do F_GETLK, we can do:
> > >
> > > task = find_task_by_pid_ns(fl_pid, init_pid_ns)
> > > fl_nspid = task_pid_nr_ns(task, task_active_pid_ns(current))
> > >
> > > That moves all the work off into the F_GETLK case, which I think is
> > > not
> > > used
> > > so much.
> > >
> >
> > Actually I think what might work best is to:
> >
> > - have locks_copy_conflock also copy the fl_nspid and take a reference
> > to it (as your patch #2 does)
> >
> > - only set fl_nspid and take a reference there in
> > locks_insert_lock_ctx
> > if it's not already set
> >
> > - allow ->lock operations (like nfs) to set fl_nspid before they call
> > locks_lock_inode_wait to set the local lock. Might need to take a
> > nspid
> > reference before dispatching an RPC so that you get the right thread
> > context.
>
> It would, but I think fl_nspid is completely unnecessary. The reason we
> have it so that we can translate the pid number into other namespaces,
> the
> most common case being that F_GETLK and views of /proc/locks within a
> namespace represent the same pid numbers as the processes in that
> namespace
> that are holding the locks.
>
> It is much simpler to just keep using fl_pid as the pid number in the
> init
> namespace, but move the translation of that pid number to lookup time,
> rather than creation time.
>

I think that would also work and I like the idea of getting rid of a
field in file_lock.

So, to be clear:

fl_pid would then store the pid of the process in the init_pid_ns, and
you'd just translate it as appropriate to the requestor's namespace?

If we want to go that route, then you'll probably still need a flag of
some sort to indicate that the fl_pid is to be expressed "as is", for
remote filesystems.

OTOH, if the lock is held remotely, I wonder if we'd be better off
simply reporting the pid as '-1', like we do with OFD locks. Hardly
anything pays attention to l_pid anyway and it's more or less
meaningless once the filesystem extends beyond the machine you're on.

That said, I'd be inclined to do that in a separate set so we could
revert it if it caused problems somewhere.
--
Jeff Layton <[email protected]>

2017-06-06 13:21:24

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [lkp-robot] [fs/locks] 9d21d181d0: will-it-scale.per_process_ops -14.1% regression

On 6 Jun 2017, at 9:15, Jeff Layton wrote:

> On Tue, 2017-06-06 at 09:00 -0400, Benjamin Coddington wrote:
>> It would, but I think fl_nspid is completely unnecessary. The reason we
>> have it so that we can translate the pid number into other namespaces,
>> the
>> most common case being that F_GETLK and views of /proc/locks within a
>> namespace represent the same pid numbers as the processes in that
>> namespace
>> that are holding the locks.
>>
>> It is much simpler to just keep using fl_pid as the pid number in the
>> init
>> namespace, but move the translation of that pid number to lookup time,
>> rather than creation time.
>>
>
> I think that would also work and I like the idea of getting rid of a
> field in file_lock.
>
> So, to be clear:
>
> fl_pid would then store the pid of the process in the init_pid_ns, and
> you'd just translate it as appropriate to the requestor's namespace?

Right!

> If we want to go that route, then you'll probably still need a flag of
> some sort to indicate that the fl_pid is to be expressed "as is", for
> remote filesystems.

Yes, the hack I have now still uses that flag.

> OTOH, if the lock is held remotely, I wonder if we'd be better off
> simply reporting the pid as '-1', like we do with OFD locks. Hardly
> anything pays attention to l_pid anyway and it's more or less
> meaningless once the filesystem extends beyond the machine you're on.
>
> That said, I'd be inclined to do that in a separate set so we could
> revert it if it caused problems somewhere.

I think keeping that change separate is a good idea. It might be a good
idea to just stay away from that change unless there's a compelling reason
for it. If l_sysid ever happens, then there's no need to change the
behavior for l_pid back again.

I'll send some patches in a bit when I am happy with them.

Ben