2017-03-06 19:07:35

by Tejun Heo

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller

Hello,

On Fri, Feb 17, 2017 at 12:37:11PM +0100, Krzysztof Opasiak wrote:
> > We need to limit and monitor the number of file descriptors processes
> > keep open. If a process exceeds certain limit we'd like to terminate it
> > and restart it or reboot the whole system. Currently the RLIMIT API
> > allows limiting the number of file descriptors but to achieve our goals
> > we'd need to make sure all programmes we run handle EMFILE errno
> > properly. That is why we consider developing a cgroup controller that
> > limits the number of open file descriptors of its members (similar to
> > memory controler).
> >
> > Any comments? Is there any alternative that:
> >
> > + does not require modifications of user-land code,
> > + enables other process (e.g. init) to be notified and apply policy.

Hmm... I'm not quite sure fds qualify as an independent system-wide
resource. We did that for pids because pids are globally limited and
can run out way earlier than memory backing it. I don't think we have
similar restructions for fds, do we?

Thanks.

--
tejun


2017-03-07 12:35:04

by Krzysztof Opasiak

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller

Hi

On 03/06/2017 07:58 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Feb 17, 2017 at 12:37:11PM +0100, Krzysztof Opasiak wrote:
>>> We need to limit and monitor the number of file descriptors processes
>>> keep open. If a process exceeds certain limit we'd like to terminate it
>>> and restart it or reboot the whole system. Currently the RLIMIT API
>>> allows limiting the number of file descriptors but to achieve our goals
>>> we'd need to make sure all programmes we run handle EMFILE errno
>>> properly. That is why we consider developing a cgroup controller that
>>> limits the number of open file descriptors of its members (similar to
>>> memory controler).
>>>
>>> Any comments? Is there any alternative that:
>>>
>>> + does not require modifications of user-land code,
>>> + enables other process (e.g. init) to be notified and apply policy.
>
> Hmm... I'm not quite sure fds qualify as an independent system-wide
> resource. We did that for pids because pids are globally limited and
> can run out way earlier than memory backing it. I don't think we have
> similar restructions for fds, do we?

Well I'm not aware of such restrictions...

So maybe let me clarify our use case so we can have some more discussion
about this. We are dealing with task of monitoring system services on an
IoT system. So this system needs to run as long as possible without
reboot just like server. In server world almost whole system state is
being monitored by services like nagios. They measure each parameter
(like cpu, memory etc) with some interval. Unfortunately we cannot use
this it in an embedded system due to power consumption.

So generally now we consider two approaches:

1) Use rlimits when possible to limit resources for each process.

The problem here is that this creates an implicit requirement that all
system services are well written and able to detect that they for
example run out of fd and they will just exit with a suitable error code
instead of hanging forever and responding to clients that they are
unable to handle their request due to lack of fd. This is hard specially
when service use a lot of libraries under the hood because they also
need to return this error code from each functions which opens some
files. This is especially hard when using some proprietary services or
libraries for we don't have access to source code.

2) Use cgroups to limit and monitor resources usage

Generally systemd creates a cgroup for each service. cgroups like memory
cgroup has an ability to notify userspace when memory usage reaches some
level. So for example systemd could get notification that one of cgroups
is using more memory than it should but as long as it's not a hard limit
of the cgroup this service is not going to even notice this. So instead
of returning error from for example malloc() in service, systemd could
just send signal to that service and ask it to exit gracefully and the
restart it. The disadvantage of this solution is the need of having
cgroup for each resource we would like to monitor. For now we have
suitable cgroups for everything we need apart from file descriptors.

What do you think about this? Maybe you have some other ideas how we
could achieve this?

Best regards,
--
Krzysztof Opasiak
Samsung R&D Institute Poland
Samsung Electronics

2017-03-07 19:42:16

by Tejun Heo

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller

Hello, Krzysztof.

On Tue, Mar 07, 2017 at 12:19:52PM +0100, Krzysztof Opasiak wrote:
> So maybe let me clarify our use case so we can have some more discussion
> about this. We are dealing with task of monitoring system services on an IoT
> system. So this system needs to run as long as possible without reboot just
> like server. In server world almost whole system state is being monitored by
> services like nagios. They measure each parameter (like cpu, memory etc)
> with some interval. Unfortunately we cannot use this it in an embedded
> system due to power consumption.

So, we don't add controllers for specific use case scenarios. The
target actually has to be a fundamental resource which can't be
isolated in a different way.

The use case you're describing is more about working around
shortcomings in userspace by implemneting a major kernel feature, when
the said shortcomings can easily be controlled and mitigated from
userspace - e.g. if running out of fds can't be handled reliably from
the target application for some reason and the application may lock up
from the condition, protect the base resources so that a monitoring
process can always reliably run and let that take a corrective action
when such condition is detected.

This doesn't really seem to qualify as a dedicated kernel
functionality.

Thanks.

--
tejun

2017-03-07 20:07:38

by Krzysztof Opasiak

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller



On 03/07/2017 08:41 PM, Tejun Heo wrote:
> Hello, Krzysztof.
>
> On Tue, Mar 07, 2017 at 12:19:52PM +0100, Krzysztof Opasiak wrote:
>> So maybe let me clarify our use case so we can have some more discussion
>> about this. We are dealing with task of monitoring system services on an IoT
>> system. So this system needs to run as long as possible without reboot just
>> like server. In server world almost whole system state is being monitored by
>> services like nagios. They measure each parameter (like cpu, memory etc)
>> with some interval. Unfortunately we cannot use this it in an embedded
>> system due to power consumption.
>
> So, we don't add controllers for specific use case scenarios. The
> target actually has to be a fundamental resource which can't be
> isolated in a different way.
>
> The use case you're describing is more about working around
> shortcomings in userspace by implemneting a major kernel feature, when
> the said shortcomings can easily be controlled and mitigated from
> userspace - e.g. if running out of fds can't be handled reliably from
> the target application for some reason and the application may lock up
> from the condition, protect the base resources so that a monitoring
> process can always reliably run and let that take a corrective action
> when such condition is detected.
>

In theory that's what we plan to do but we are looking for an efficient
method of detecting that this particular application is using more fds
than it should (declared by developer).

Personally, I don't want to use rlimit for this as it ends up returning
error code from for example open() when we hit the limit. This may lead
to some unpredictable crashes in services (esp. those poor proprietary
binary blobs). Instead of injecting errors to service we would like to
just get notification that this service has more opened fds than it
should and ask it to restart in a polite way.

For memory seems to be quite easy to achieve as we can just get eventfd
notification when application passes given memory usage using memory
cgroup controller. Maybe you know some efficient method to do the same
for fds?

Best regards,
--
Krzysztof Opasiak
Samsung R&D Institute Poland
Samsung Electronics

2017-03-08 03:00:15

by Parav Pandit

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller

Hi,

On Tue, Mar 7, 2017 at 2:48 PM, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, Mar 07, 2017 at 09:06:49PM +0100, Krzysztof Opasiak wrote:
> > Personally, I don't want to use rlimit for this as it ends up returning
> > error code from for example open() when we hit the limit. This may lead to
> > some unpredictable crashes in services (esp. those poor proprietary binary
> > blobs). Instead of injecting errors to service we would like to just get
> > notification that this service has more opened fds than it should and ask it
> > to restart in a polite way.
> >

How does those poor proprietary binary blobs remain polite after restart?
Do you mean you want to keep restarting them when it reaches the limit?

> > For memory seems to be quite easy to achieve as we can just get eventfd
> > notification when application passes given memory usage using memory cgroup
> > controller. Maybe you know some efficient method to do the same for fds?
>
> So, if all you wanna do is reliably detecting open(2) failures, can't
> you do that with bpf tracing?
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2017-03-08 10:22:55

by Krzysztof Opasiak

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller



On 03/08/2017 03:59 AM, Parav Pandit wrote:
> Hi,
>
> On Tue, Mar 7, 2017 at 2:48 PM, Tejun Heo <[email protected]> wrote:
>>
>> Hello,
>>
>> On Tue, Mar 07, 2017 at 09:06:49PM +0100, Krzysztof Opasiak wrote:
>>> Personally, I don't want to use rlimit for this as it ends up returning
>>> error code from for example open() when we hit the limit. This may lead to
>>> some unpredictable crashes in services (esp. those poor proprietary binary
>>> blobs). Instead of injecting errors to service we would like to just get
>>> notification that this service has more opened fds than it should and ask it
>>> to restart in a polite way.
>>>
>
> How does those poor proprietary binary blobs remain polite after restart?

They wont.

> Do you mean you want to keep restarting them when it reaches the limit?

We'd like to restart them each time when they reach limit declared by
developer.

Best regards,
--
Krzysztof Opasiak
Samsung R&D Institute Poland
Samsung Electronics

2017-03-08 12:07:51

by Krzysztof Opasiak

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller



On 03/07/2017 09:48 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Mar 07, 2017 at 09:06:49PM +0100, Krzysztof Opasiak wrote:
>> Personally, I don't want to use rlimit for this as it ends up returning
>> error code from for example open() when we hit the limit. This may lead to
>> some unpredictable crashes in services (esp. those poor proprietary binary
>> blobs). Instead of injecting errors to service we would like to just get
>> notification that this service has more opened fds than it should and ask it
>> to restart in a polite way.
>>
>> For memory seems to be quite easy to achieve as we can just get eventfd
>> notification when application passes given memory usage using memory cgroup
>> controller. Maybe you know some efficient method to do the same for fds?
>
> So, if all you wanna do is reliably detecting open(2) failures, can't
> you do that with bpf tracing?
>

Well detecting failures of open is not enough and it has couple of problems:

1) open(2) is not the only syscall which creates fd. In addition to
other syscalls like socket(2), dup(2), some ioctl() on drivers (for
example video) also creates fds. I'm not sure if we have any other
mechanism than grep through kernel source to find out which ioctl()
creates fd or and which not.

2) As far as I know (I'm not a bpf specialist so please correct me if
I'm wrong), with bpf we are able only to detect such events but we are
unable to prevent them from getting to caller. It means that service
will know that it run out of fds and will need to handle this properly.
If there is a bug in this error path service may crash.
What we would like to get is just a notification to external process
that some limit has been reached without returning error to service itself.

3) Theoretically we could do this using bpf or syscall auditing and
count fds for each userspace process or check /proc/<PID> after each
notification but it's getting very heavy for production environment.

Best regards,
--
Krzysztof Opasiak
Samsung R&D Institute Poland
Samsung Electronics

2017-03-08 14:49:34

by Tejun Heo

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller

Hello,

On Tue, Mar 07, 2017 at 09:06:49PM +0100, Krzysztof Opasiak wrote:
> Personally, I don't want to use rlimit for this as it ends up returning
> error code from for example open() when we hit the limit. This may lead to
> some unpredictable crashes in services (esp. those poor proprietary binary
> blobs). Instead of injecting errors to service we would like to just get
> notification that this service has more opened fds than it should and ask it
> to restart in a polite way.
>
> For memory seems to be quite easy to achieve as we can just get eventfd
> notification when application passes given memory usage using memory cgroup
> controller. Maybe you know some efficient method to do the same for fds?

So, if all you wanna do is reliably detecting open(2) failures, can't
you do that with bpf tracing?

Thanks.

--
tejun

2017-03-08 19:01:49

by Tejun Heo

[permalink] [raw]
Subject: Re: counting file descriptors with a cgroup controller

On Wed, Mar 08, 2017 at 10:52:18AM +0100, Krzysztof Opasiak wrote:
> Well detecting failures of open is not enough and it has couple of problems:
>
> 1) open(2) is not the only syscall which creates fd. In addition to other
> syscalls like socket(2), dup(2), some ioctl() on drivers (for example video)
> also creates fds. I'm not sure if we have any other mechanism than grep
> through kernel source to find out which ioctl() creates fd or and which not.
>
> 2) As far as I know (I'm not a bpf specialist so please correct me if I'm
> wrong), with bpf we are able only to detect such events but we are unable to
> prevent them from getting to caller. It means that service will know that it
> run out of fds and will need to handle this properly. If there is a bug in
> this error path service may crash.
> What we would like to get is just a notification to external process that
> some limit has been reached without returning error to service itself.
>
> 3) Theoretically we could do this using bpf or syscall auditing and count
> fds for each userspace process or check /proc/<PID> after each notification
> but it's getting very heavy for production environment.

We simply can't design the kernel to accomodate bandaid workarounds
for grossly misbehaving applications. If you can find something which
can solve the problem using wider scope tools like bpf, seccomp, and
what not, great. If not, too bad, but we can't burdern everyone else
with workarounds for the extremely specific and contrived issues that
you're seeing.

Thanks.

--
tejun