2024-01-26 13:02:44

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net

On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > This is the last global stat, move it into nfsd_net and adjust all the
> > > users to use that variant instead of the global one.
> >
> > Hm. I thought nfsd threads were a global resource -- they service
> > all network namespaces. So, shouldn't the same thread count be
> > surfaced to all containers? Won't they all see all of the nfsd
> > processes?
> >
>

Each container is going to start /proc/fs/nfsd/threads number of threads
regardless. I hadn't actually grokked that they just get tossed onto the
pile of threads that service requests.

Is is possible for one container to start a small number of threads but
have its client load be such that it spills over and ends up stealing
threads from other containers?

> I don't think we want the network namespaces seeing how many threads exist in
> the entire system right?
>
> Additionally it appears that we can have multiple threads per network namespace,
> so it's not like this will just show 1 for each individual nn, it'll show
> however many threads have been configured for that nfsd in that network
> namespace.
>
> I'm good either way, but it makes sense to me to only surface the network
> namespace related thread count. I could probably have a global counter and only
> surface the global counter if net == &init_net. Let me know what you prefer.
> Thanks,
>

The old stat meant "number of threads that are currently busy". The new
one will mean "number of threads that are busy servicing requests in
this namespace". The latter seems more useful on a per-ns basis.

In fact, it might be interesting to throw out some kind of warning when
the th_cnt exceeds the number of threads that you've allocated in the
container. That might be an indicator that you didn't spin up enough and
are poaching from other containers.

Or... maybe we should be queueing the RPC when that happens so that you
can't use more than you've allocated for the container?
--
Jeff Layton <[email protected]>


2024-01-26 13:48:37

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net



> On Jan 26, 2024, at 8:01 AM, Jeff Layton <[email protected]> wrote:
>
> On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
>> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
>>> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
>>>> This is the last global stat, move it into nfsd_net and adjust all the
>>>> users to use that variant instead of the global one.
>>>
>>> Hm. I thought nfsd threads were a global resource -- they service
>>> all network namespaces. So, shouldn't the same thread count be
>>> surfaced to all containers? Won't they all see all of the nfsd
>>> processes?
>
> Each container is going to start /proc/fs/nfsd/threads number of threads
> regardless. I hadn't actually grokked that they just get tossed onto the
> pile of threads that service requests.
>
> Is is possible for one container to start a small number of threads but
> have its client load be such that it spills over and ends up stealing
> threads from other containers?

I haven't seen any code that manages resources based on namespace,
except in filecache.c to restrict writeback per namespace.

My impression is that any nfsd thread can serve any namespace. I'm
not sure it is currently meaningful for a particular net namespace to
"create" more threads.

If someone would like that level of control, we could implement a
cgroup mechanism and have one or more separate svc_pools per net
namespace, maybe? </hand wave>


>> I don't think we want the network namespaces seeing how many threads exist in
>> the entire system right?

If someone in a non-init net namespace does a "pgrep -c nfsd" don't
they see the total nfsd thread count for the host?


>> Additionally it appears that we can have multiple threads per network namespace,
>> so it's not like this will just show 1 for each individual nn, it'll show
>> however many threads have been configured for that nfsd in that network
>> namespace.

I've never tried this, so I'm speculating. But it seems like for
now, because all nfsd threads can serve all namespaces, they should
all see the global thread count stat.

Then later we can refine it.


>> I'm good either way, but it makes sense to me to only surface the network
>> namespace related thread count. I could probably have a global counter and only
>> surface the global counter if net == &init_net. Let me know what you prefer.


--
Chuck Lever


2024-01-26 14:08:36

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net

On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
>
> > On Jan 26, 2024, at 8:01 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > > > This is the last global stat, move it into nfsd_net and adjust all the
> > > > > users to use that variant instead of the global one.
> > > >
> > > > Hm. I thought nfsd threads were a global resource -- they service
> > > > all network namespaces. So, shouldn't the same thread count be
> > > > surfaced to all containers? Won't they all see all of the nfsd
> > > > processes?
> >
> > Each container is going to start /proc/fs/nfsd/threads number of threads
> > regardless. I hadn't actually grokked that they just get tossed onto the
> > pile of threads that service requests.
> >
> > Is is possible for one container to start a small number of threads but
> > have its client load be such that it spills over and ends up stealing
> > threads from other containers?
>
> I haven't seen any code that manages resources based on namespace,
> except in filecache.c to restrict writeback per namespace.
>
> My impression is that any nfsd thread can serve any namespace. I'm
> not sure it is currently meaningful for a particular net namespace to
> "create" more threads.
>
> If someone would like that level of control, we could implement a
> cgroup mechanism and have one or more separate svc_pools per net
> namespace, maybe? </hand wave>
>

AFAICT, the total number of threads on the system will be the sum of the
threads started in each of the containers. They do just go into a big
pile, and whichever one wakes up will service the request, so the
threads aren't associated with the netns, per-se. The svc_rqst's however
_are_ per-netns. So, I don't see anything that ensures that a container
doesn't exceed the number of threads it started on its own behalf.

<hand wave>
I'm not sure we'd need to tie this in to cgroups. Now that Josef is
moving some of these key structures to be per-net, it should be fairly
simple to have nfsd() just look at the th_cnt and the thread count in
the current namespace, and just enqueue the RPC rather than doing it?
</hand wave>

OTOH, maybe I'm overly concerned here.


>
> > > I don't think we want the network namespaces seeing how many threads exist in
> > > the entire system right?
>
> If someone in a non-init net namespace does a "pgrep -c nfsd" don't
> they see the total nfsd thread count for the host?
>

Yes, they're kernel threads and they aren't associated with a particular
pid namespace.

>
> > > Additionally it appears that we can have multiple threads per network namespace,
> > > so it's not like this will just show 1 for each individual nn, it'll show
> > > however many threads have been configured for that nfsd in that network
> > > namespace.
>
> I've never tried this, so I'm speculating. But it seems like for
> now, because all nfsd threads can serve all namespaces, they should
> all see the global thread count stat.
>
> Then later we can refine it.
>

I don't think that info is particularly useful though, and it certainly
breaks expectations wrt container isolation.

Look at it this way:

Suppose I have access to a container and I spin up nfsd with a
particular number of threads. I now want to know "did I spin up enough
threads?" By making this per-namespace as Josef suggests it should be
fairly simple to tell whether my clients are regularly overrunning the
threads I started. With this info as global, I have no idea what netns
the RPCs being counted are against. I can't do anything with that info.

>
> > > I'm good either way, but it makes sense to me to only surface the network
> > > namespace related thread count. I could probably have a global counter and only
> > > surface the global counter if net == &init_net. Let me know what you prefer.
>

--
Jeff Layton <[email protected]>

2024-01-26 14:28:47

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net



> On Jan 26, 2024, at 9:08 AM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
>>
>>> On Jan 26, 2024, at 8:01 AM, Jeff Layton <[email protected]> wrote:
>>>
>>> On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
>>>> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
>>>>> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
>>>>>> This is the last global stat, move it into nfsd_net and adjust all the
>>>>>> users to use that variant instead of the global one.
>>>>>
>>>>> Hm. I thought nfsd threads were a global resource -- they service
>>>>> all network namespaces. So, shouldn't the same thread count be
>>>>> surfaced to all containers? Won't they all see all of the nfsd
>>>>> processes?
>>>
>>> Each container is going to start /proc/fs/nfsd/threads number of threads
>>> regardless. I hadn't actually grokked that they just get tossed onto the
>>> pile of threads that service requests.
>>>
>>> Is is possible for one container to start a small number of threads but
>>> have its client load be such that it spills over and ends up stealing
>>> threads from other containers?
>>
>> I haven't seen any code that manages resources based on namespace,
>> except in filecache.c to restrict writeback per namespace.
>>
>> My impression is that any nfsd thread can serve any namespace. I'm
>> not sure it is currently meaningful for a particular net namespace to
>> "create" more threads.
>>
>> If someone would like that level of control, we could implement a
>> cgroup mechanism and have one or more separate svc_pools per net
>> namespace, maybe? </hand wave>
>>
>
> AFAICT, the total number of threads on the system will be the sum of the
> threads started in each of the containers. They do just go into a big
> pile, and whichever one wakes up will service the request, so the
> threads aren't associated with the netns, per-se. The svc_rqst's however
> _are_ per-netns. So, I don't see anything that ensures that a container
> doesn't exceed the number of threads it started on its own behalf.
>
> <hand wave>
> I'm not sure we'd need to tie this in to cgroups.

Not a need, but cgroups are typically the way that such
resource constraints are managed, that's all.


> Now that Josef is
> moving some of these key structures to be per-net, it should be fairly
> simple to have nfsd() just look at the th_cnt and the thread count in
> the current namespace, and just enqueue the RPC rather than doing it?
> </hand wave>
>
> OTOH, maybe I'm overly concerned here.
>
>
>>
>>>> I don't think we want the network namespaces seeing how many threads exist in
>>>> the entire system right?
>>
>> If someone in a non-init net namespace does a "pgrep -c nfsd" don't
>> they see the total nfsd thread count for the host?
>>
>
> Yes, they're kernel threads and they aren't associated with a particular
> pid namespace.
>
>>
>>>> Additionally it appears that we can have multiple threads per network namespace,
>>>> so it's not like this will just show 1 for each individual nn, it'll show
>>>> however many threads have been configured for that nfsd in that network
>>>> namespace.
>>
>> I've never tried this, so I'm speculating. But it seems like for
>> now, because all nfsd threads can serve all namespaces, they should
>> all see the global thread count stat.
>>
>> Then later we can refine it.
>>
>
> I don't think that info is particularly useful though, and it certainly
> breaks expectations wrt container isolation.
>
> Look at it this way:
>
> Suppose I have access to a container and I spin up nfsd with a
> particular number of threads. I now want to know "did I spin up enough
> threads?"

It makes sense to me for a container to ask for one or more
NFSD listeners. But I'm not yet clear on what it means for
a container to try to alter the NFSD thread count, which is
currently global.


> By making this per-namespace as Josef suggests it should be
> fairly simple to tell whether my clients are regularly overrunning the
> threads I started. With this info as global, I have no idea what netns
> the RPCs being counted are against. I can't do anything with that info.

That's true. I'm just not sure we need to do anything
significant about it as part of this patch series.

I'm not at all advocating leaving it like this. I agree it
needs updating.

For now, just fill in that thread count stat with the global
thread count, and later when we have a better concept for
what it means to "adjust the nfsd thread count from inside a
container" we can come back and make it make sense.


--
Chuck Lever


2024-01-26 15:03:44

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net

On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
>
> > On Jan 26, 2024, at 9:08 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
> > >
> > > > On Jan 26, 2024, at 8:01 AM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> > > > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > > > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > > > > > This is the last global stat, move it into nfsd_net and adjust all the
> > > > > > > users to use that variant instead of the global one.
> > > > > >
> > > > > > Hm. I thought nfsd threads were a global resource -- they service
> > > > > > all network namespaces. So, shouldn't the same thread count be
> > > > > > surfaced to all containers? Won't they all see all of the nfsd
> > > > > > processes?
> > > >
> > > > Each container is going to start /proc/fs/nfsd/threads number of threads
> > > > regardless. I hadn't actually grokked that they just get tossed onto the
> > > > pile of threads that service requests.
> > > >
> > > > Is is possible for one container to start a small number of threads but
> > > > have its client load be such that it spills over and ends up stealing
> > > > threads from other containers?
> > >
> > > I haven't seen any code that manages resources based on namespace,
> > > except in filecache.c to restrict writeback per namespace.
> > >
> > > My impression is that any nfsd thread can serve any namespace. I'm
> > > not sure it is currently meaningful for a particular net namespace to
> > > "create" more threads.
> > >
> > > If someone would like that level of control, we could implement a
> > > cgroup mechanism and have one or more separate svc_pools per net
> > > namespace, maybe? </hand wave>
> > >
> >
> > AFAICT, the total number of threads on the system will be the sum of the
> > threads started in each of the containers. They do just go into a big
> > pile, and whichever one wakes up will service the request, so the
> > threads aren't associated with the netns, per-se. The svc_rqst's however
> > _are_ per-netns. So, I don't see anything that ensures that a container
> > doesn't exceed the number of threads it started on its own behalf.
> >
> > <hand wave>
> > I'm not sure we'd need to tie this in to cgroups.
>
> Not a need, but cgroups are typically the way that such
> resource constraints are managed, that's all.
>
>
> > Now that Josef is
> > moving some of these key structures to be per-net, it should be fairly
> > simple to have nfsd() just look at the th_cnt and the thread count in
> > the current namespace, and just enqueue the RPC rather than doing it?
> > </hand wave>
> >
> > OTOH, maybe I'm overly concerned here.
> >
> >
> > >
> > > > > I don't think we want the network namespaces seeing how many threads exist in
> > > > > the entire system right?
> > >
> > > If someone in a non-init net namespace does a "pgrep -c nfsd" don't
> > > they see the total nfsd thread count for the host?
> > >
> >
> > Yes, they're kernel threads and they aren't associated with a particular
> > pid namespace.
> >
> > >
> > > > > Additionally it appears that we can have multiple threads per network namespace,
> > > > > so it's not like this will just show 1 for each individual nn, it'll show
> > > > > however many threads have been configured for that nfsd in that network
> > > > > namespace.
> > >
> > > I've never tried this, so I'm speculating. But it seems like for
> > > now, because all nfsd threads can serve all namespaces, they should
> > > all see the global thread count stat.
> > >
> > > Then later we can refine it.
> > >
> >
> > I don't think that info is particularly useful though, and it certainly
> > breaks expectations wrt container isolation.
> >
> > Look at it this way:
> >
> > Suppose I have access to a container and I spin up nfsd with a
> > particular number of threads. I now want to know "did I spin up enough
> > threads?"
>
> It makes sense to me for a container to ask for one or more
> NFSD listeners. But I'm not yet clear on what it means for
> a container to try to alter the NFSD thread count, which is
> currently global.
>

write_threads sets the number of nfsd threads for the svc_serv, which is
a per-net object, so serv->sv_nrthreads is only counting the threads for
its namespace.

Each container that starts an nfsd will contribute "threads" number of
nfsds to the pile. When you set "threads" to a different number, the
total pile is grown or reduced accordingly. In fact, I'm not even sure
we keep a systemwide total.

What _isn't_ done (unless I'm missing something) is any sort of
limitation on the use of those threads. So you could have a container
that starts one nfsd thread, but its clients can still have many more
RPCs in flight, up to the total number of nfsd's running on the host.
Limiting that might be possible, but I'm not yet convinced that it's a
good idea.

It might be interesting to gather some metrics though. How often do the
number of RPCs in flight exceed the number of threads that we started in
a namespace? Might also be good to gather a high-water mark too?

>
> > By making this per-namespace as Josef suggests it should be
> > fairly simple to tell whether my clients are regularly overrunning the
> > threads I started. With this info as global, I have no idea what netns
> > the RPCs being counted are against. I can't do anything with that info.
>
> That's true. I'm just not sure we need to do anything
> significant about it as part of this patch series.
>
> I'm not at all advocating leaving it like this. I agree it
> needs updating.
>
> For now, just fill in that thread count stat with the global
> thread count, and later when we have a better concept for
> what it means to "adjust the nfsd thread count from inside a
> container" we can come back and make it make sense.

It would be good to step back and decide how we think this should work.
What would this look like if we were designing it today? Then we can
decide how to get there.

Personally, I think keeping the view inside the container as isolated as
possible is the right thing to do.
--
Jeff Layton <[email protected]>

2024-01-26 15:16:30

by Chuck Lever III

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net



> On Jan 26, 2024, at 10:03 AM, Jeff Layton <[email protected]> wrote:
>
> On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
>>
>>> On Jan 26, 2024, at 9:08 AM, Jeff Layton <[email protected]> wrote:
>>>
>>> On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
>>>>
>>>>> On Jan 26, 2024, at 8:01 AM, Jeff Layton <[email protected]> wrote:
>>>>>
>>>>> On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
>>>>>> On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
>>>>>>> On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
>>>>>>>> This is the last global stat, move it into nfsd_net and adjust all the
>>>>>>>> users to use that variant instead of the global one.
>>>>>>>
>>>>>>> Hm. I thought nfsd threads were a global resource -- they service
>>>>>>> all network namespaces. So, shouldn't the same thread count be
>>>>>>> surfaced to all containers? Won't they all see all of the nfsd
>>>>>>> processes?
>>>>>
>>>>> Each container is going to start /proc/fs/nfsd/threads number of threads
>>>>> regardless. I hadn't actually grokked that they just get tossed onto the
>>>>> pile of threads that service requests.
>>>>>
>>>>> Is is possible for one container to start a small number of threads but
>>>>> have its client load be such that it spills over and ends up stealing
>>>>> threads from other containers?
>>>>
>>>> I haven't seen any code that manages resources based on namespace,
>>>> except in filecache.c to restrict writeback per namespace.
>>>>
>>>> My impression is that any nfsd thread can serve any namespace. I'm
>>>> not sure it is currently meaningful for a particular net namespace to
>>>> "create" more threads.
>>>>
>>>> If someone would like that level of control, we could implement a
>>>> cgroup mechanism and have one or more separate svc_pools per net
>>>> namespace, maybe? </hand wave>
>>>>
>>>
>>> AFAICT, the total number of threads on the system will be the sum of the
>>> threads started in each of the containers. They do just go into a big
>>> pile, and whichever one wakes up will service the request, so the
>>> threads aren't associated with the netns, per-se. The svc_rqst's however
>>> _are_ per-netns. So, I don't see anything that ensures that a container
>>> doesn't exceed the number of threads it started on its own behalf.
>>>
>>> <hand wave>
>>> I'm not sure we'd need to tie this in to cgroups.
>>
>> Not a need, but cgroups are typically the way that such
>> resource constraints are managed, that's all.
>>
>>
>>> Now that Josef is
>>> moving some of these key structures to be per-net, it should be fairly
>>> simple to have nfsd() just look at the th_cnt and the thread count in
>>> the current namespace, and just enqueue the RPC rather than doing it?
>>> </hand wave>
>>>
>>> OTOH, maybe I'm overly concerned here.
>>>
>>>
>>>>
>>>>>> I don't think we want the network namespaces seeing how many threads exist in
>>>>>> the entire system right?
>>>>
>>>> If someone in a non-init net namespace does a "pgrep -c nfsd" don't
>>>> they see the total nfsd thread count for the host?
>>>>
>>>
>>> Yes, they're kernel threads and they aren't associated with a particular
>>> pid namespace.
>>>
>>>>
>>>>>> Additionally it appears that we can have multiple threads per network namespace,
>>>>>> so it's not like this will just show 1 for each individual nn, it'll show
>>>>>> however many threads have been configured for that nfsd in that network
>>>>>> namespace.
>>>>
>>>> I've never tried this, so I'm speculating. But it seems like for
>>>> now, because all nfsd threads can serve all namespaces, they should
>>>> all see the global thread count stat.
>>>>
>>>> Then later we can refine it.
>>>>
>>>
>>> I don't think that info is particularly useful though, and it certainly
>>> breaks expectations wrt container isolation.
>>>
>>> Look at it this way:
>>>
>>> Suppose I have access to a container and I spin up nfsd with a
>>> particular number of threads. I now want to know "did I spin up enough
>>> threads?"
>>
>> It makes sense to me for a container to ask for one or more
>> NFSD listeners. But I'm not yet clear on what it means for
>> a container to try to alter the NFSD thread count, which is
>> currently global.
>>
>
> write_threads sets the number of nfsd threads for the svc_serv, which is
> a per-net object, so serv->sv_nrthreads is only counting the threads for
> its namespace.

OK. I missed the part where struct svc_serv is per-net, and
each svc_serv has its own thread pool. Got it.


> Each container that starts an nfsd will contribute "threads" number of
> nfsds to the pile. When you set "threads" to a different number, the
> total pile is grown or reduced accordingly. In fact, I'm not even sure
> we keep a systemwide total.
>
> What _isn't_ done (unless I'm missing something) is any sort of
> limitation on the use of those threads. So you could have a container
> that starts one nfsd thread, but its clients can still have many more
> RPCs in flight, up to the total number of nfsd's running on the host.
> Limiting that might be possible, but I'm not yet convinced that it's a
> good idea.
>
> It might be interesting to gather some metrics though. How often do the
> number of RPCs in flight exceed the number of threads that we started in
> a namespace? Might also be good to gather a high-water mark too?

I had a bunch of trace points queued up for this purpose,
and all that got thrown out by Neil's recent work fixing
up our pool thread scheduler.

I'm not sure that queued requests are all that meaningful
or even quantifiable. A more meaningful metric is round
trip time measured on clients.


>>> By making this per-namespace as Josef suggests it should be
>>> fairly simple to tell whether my clients are regularly overrunning the
>>> threads I started. With this info as global, I have no idea what netns
>>> the RPCs being counted are against. I can't do anything with that info.
>>
>> That's true. I'm just not sure we need to do anything
>> significant about it as part of this patch series.
>>
>> I'm not at all advocating leaving it like this. I agree it
>> needs updating.
>>
>> For now, just fill in that thread count stat with the global
>> thread count, and later when we have a better concept for
>> what it means to "adjust the nfsd thread count from inside a
>> container" we can come back and make it make sense.
>
> It would be good to step back and decide how we think this should work.
> What would this look like if we were designing it today? Then we can
> decide how to get there.
>
> Personally, I think keeping the view inside the container as isolated as
> possible is the right thing to do.

I agree. I would like to get Josef's patches in quickly too.

Should that metric report svc_serv::sv_nrthreads instead of
the global thread count?


--
Chuck Lever


2024-01-26 15:36:06

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] nfsd: move th_cnt into nfsd_net

On Fri, 2024-01-26 at 15:16 +0000, Chuck Lever III wrote:
>
> > On Jan 26, 2024, at 10:03 AM, Jeff Layton <[email protected]> wrote:
> >
> > On Fri, 2024-01-26 at 14:27 +0000, Chuck Lever III wrote:
> > >
> > > > On Jan 26, 2024, at 9:08 AM, Jeff Layton <[email protected]> wrote:
> > > >
> > > > On Fri, 2024-01-26 at 13:48 +0000, Chuck Lever III wrote:
> > > > >
> > > > > > On Jan 26, 2024, at 8:01 AM, Jeff Layton <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, 2024-01-25 at 16:56 -0500, Josef Bacik wrote:
> > > > > > > On Thu, Jan 25, 2024 at 04:01:27PM -0500, Chuck Lever wrote:
> > > > > > > > On Thu, Jan 25, 2024 at 02:53:20PM -0500, Josef Bacik wrote:
> > > > > > > > > This is the last global stat, move it into nfsd_net and adjust all the
> > > > > > > > > users to use that variant instead of the global one.
> > > > > > > >
> > > > > > > > Hm. I thought nfsd threads were a global resource -- they service
> > > > > > > > all network namespaces. So, shouldn't the same thread count be
> > > > > > > > surfaced to all containers? Won't they all see all of the nfsd
> > > > > > > > processes?
> > > > > >
> > > > > > Each container is going to start /proc/fs/nfsd/threads number of threads
> > > > > > regardless. I hadn't actually grokked that they just get tossed onto the
> > > > > > pile of threads that service requests.
> > > > > >
> > > > > > Is is possible for one container to start a small number of threads but
> > > > > > have its client load be such that it spills over and ends up stealing
> > > > > > threads from other containers?
> > > > >
> > > > > I haven't seen any code that manages resources based on namespace,
> > > > > except in filecache.c to restrict writeback per namespace.
> > > > >
> > > > > My impression is that any nfsd thread can serve any namespace. I'm
> > > > > not sure it is currently meaningful for a particular net namespace to
> > > > > "create" more threads.
> > > > >
> > > > > If someone would like that level of control, we could implement a
> > > > > cgroup mechanism and have one or more separate svc_pools per net
> > > > > namespace, maybe? </hand wave>
> > > > >
> > > >
> > > > AFAICT, the total number of threads on the system will be the sum of the
> > > > threads started in each of the containers. They do just go into a big
> > > > pile, and whichever one wakes up will service the request, so the
> > > > threads aren't associated with the netns, per-se. The svc_rqst's however
> > > > _are_ per-netns. So, I don't see anything that ensures that a container
> > > > doesn't exceed the number of threads it started on its own behalf.
> > > >
> > > > <hand wave>
> > > > I'm not sure we'd need to tie this in to cgroups.
> > >
> > > Not a need, but cgroups are typically the way that such
> > > resource constraints are managed, that's all.
> > >
> > >
> > > > Now that Josef is
> > > > moving some of these key structures to be per-net, it should be fairly
> > > > simple to have nfsd() just look at the th_cnt and the thread count in
> > > > the current namespace, and just enqueue the RPC rather than doing it?
> > > > </hand wave>
> > > >
> > > > OTOH, maybe I'm overly concerned here.
> > > >
> > > >
> > > > >
> > > > > > > I don't think we want the network namespaces seeing how many threads exist in
> > > > > > > the entire system right?
> > > > >
> > > > > If someone in a non-init net namespace does a "pgrep -c nfsd" don't
> > > > > they see the total nfsd thread count for the host?
> > > > >
> > > >
> > > > Yes, they're kernel threads and they aren't associated with a particular
> > > > pid namespace.
> > > >
> > > > >
> > > > > > > Additionally it appears that we can have multiple threads per network namespace,
> > > > > > > so it's not like this will just show 1 for each individual nn, it'll show
> > > > > > > however many threads have been configured for that nfsd in that network
> > > > > > > namespace.
> > > > >
> > > > > I've never tried this, so I'm speculating. But it seems like for
> > > > > now, because all nfsd threads can serve all namespaces, they should
> > > > > all see the global thread count stat.
> > > > >
> > > > > Then later we can refine it.
> > > > >
> > > >
> > > > I don't think that info is particularly useful though, and it certainly
> > > > breaks expectations wrt container isolation.
> > > >
> > > > Look at it this way:
> > > >
> > > > Suppose I have access to a container and I spin up nfsd with a
> > > > particular number of threads. I now want to know "did I spin up enough
> > > > threads?"
> > >
> > > It makes sense to me for a container to ask for one or more
> > > NFSD listeners. But I'm not yet clear on what it means for
> > > a container to try to alter the NFSD thread count, which is
> > > currently global.
> > >
> >
> > write_threads sets the number of nfsd threads for the svc_serv, which is
> > a per-net object, so serv->sv_nrthreads is only counting the threads for
> > its namespace.
>
> OK. I missed the part where struct svc_serv is per-net, and
> each svc_serv has its own thread pool. Got it.
>
>
> > Each container that starts an nfsd will contribute "threads" number of
> > nfsds to the pile. When you set "threads" to a different number, the
> > total pile is grown or reduced accordingly. In fact, I'm not even sure
> > we keep a systemwide total.
> >
> > What _isn't_ done (unless I'm missing something) is any sort of
> > limitation on the use of those threads. So you could have a container
> > that starts one nfsd thread, but its clients can still have many more
> > RPCs in flight, up to the total number of nfsd's running on the host.
> > Limiting that might be possible, but I'm not yet convinced that it's a
> > good idea.
> >
> > It might be interesting to gather some metrics though. How often do the
> > number of RPCs in flight exceed the number of threads that we started in
> > a namespace? Might also be good to gather a high-water mark too?
>
> I had a bunch of trace points queued up for this purpose,
> and all that got thrown out by Neil's recent work fixing
> up our pool thread scheduler.
>
> I'm not sure that queued requests are all that meaningful
> or even quantifiable. A more meaningful metric is round
> trip time measured on clients.
>

I wouldn't try to queue anything just yet, but knowing when a particular
container is exceeding the number of threads it started seems like a
helpful metric. That could be done in a later patch though. I wouldn't
block this work on that.

>
> > > > By making this per-namespace as Josef suggests it should be
> > > > fairly simple to tell whether my clients are regularly overrunning the
> > > > threads I started. With this info as global, I have no idea what netns
> > > > the RPCs being counted are against. I can't do anything with that info.
> > >
> > > That's true. I'm just not sure we need to do anything
> > > significant about it as part of this patch series.
> > >
> > > I'm not at all advocating leaving it like this. I agree it
> > > needs updating.
> > >
> > > For now, just fill in that thread count stat with the global
> > > thread count, and later when we have a better concept for
> > > what it means to "adjust the nfsd thread count from inside a
> > > container" we can come back and make it make sense.
> >
> > It would be good to step back and decide how we think this should work.
> > What would this look like if we were designing it today? Then we can
> > decide how to get there.
> >
> > Personally, I think keeping the view inside the container as isolated as
> > possible is the right thing to do.
>
> I agree. I would like to get Josef's patches in quickly too.
>
> Should that metric report svc_serv::sv_nrthreads instead of
> the global thread count?
>

No. The sv_nrthreads is just the same value that /proc/fs/nfsd/threads
shows. The _metric_ should show the th_cnt for that particular
namespace.

It would also be nice to know what the high-water mark for th_cnt is.
IOW, if my containers clients exceeded the number of threads I've
started, then it would be helpful to know by how much.
--
Jeff Layton <[email protected]>