2021-02-04 16:39:07

by Christian König

[permalink] [raw]
Subject: Possible deny of service with memfd_create()

Hi Michal,

as requested in the other mail thread the following sample code gets my
test system down within seconds.

The issue is that the memory allocated for the file descriptor is not
accounted to the process allocating it, so the OOM killer pics whatever
process it things is good but never my small test program.

Since memfd_create() doesn't need any special permission this is a
rather nice deny of service and as far as I can see also works with a
standard Ubuntu 5.4.0-65-generic kernel.

Cheers,
Christian.

#define _GNU_SOURCE
#include <sys/mman.h>
#include <unistd.h>
#include <stdlib.h>

unsigned char page[4096];

int main(void)
{
        int i, fd;

        for (i = 0; i < 4096; ++i)
                page[i] = i;

        fd = memfd_create("test", 0);

        while (1)
                write(fd, page, 4096);
}


2021-02-04 17:18:32

by Michal Hocko

[permalink] [raw]
Subject: Re: Possible deny of service with memfd_create()

On Thu 04-02-21 17:32:20, Christian K?nig wrote:
> Hi Michal,
>
> as requested in the other mail thread the following sample code gets my test
> system down within seconds.
>
> The issue is that the memory allocated for the file descriptor is not
> accounted to the process allocating it, so the OOM killer pics whatever
> process it things is good but never my small test program.
>
> Since memfd_create() doesn't need any special permission this is a rather
> nice deny of service and as far as I can see also works with a standard
> Ubuntu 5.4.0-65-generic kernel.

Thanks for following up. This is really nasty but now that I am looking
at it more closely, this is not really different from tmpfs in general.
You are free to create files and eat the memory without being accounted
for that memory because that is not seen as your memory from the sysstem
POV. You would have to map that memory to be part of your rss.

The only existing protection right now is to use memoery cgroup
controller because the tmpfs memory is accounted to the process which
faults the memory in (or write to the file).

I am not sure there is a good way to handle this in general
unfortunatelly. Shmem is is just tricky (e.g. how to you deal with left
overs after the fd is closed?). Maybe memfd_create can be more clever
and account memory to all owners of the fd but even that sounds far from
trivial from the accounting POV. It is true that tmpfs can at least
control who can write to it which is not the case for memfd but then we
hit the backward compatibility wall.
--
Michal Hocko
SUSE Labs

2021-02-05 01:58:11

by Hugh Dickins

[permalink] [raw]
Subject: Re: Possible deny of service with memfd_create()

On Thu, 4 Feb 2021, Michal Hocko wrote:
> On Thu 04-02-21 17:32:20, Christian Koenig wrote:
> > Hi Michal,
> >
> > as requested in the other mail thread the following sample code gets my test
> > system down within seconds.
> >
> > The issue is that the memory allocated for the file descriptor is not
> > accounted to the process allocating it, so the OOM killer pics whatever
> > process it things is good but never my small test program.
> >
> > Since memfd_create() doesn't need any special permission this is a rather
> > nice deny of service and as far as I can see also works with a standard
> > Ubuntu 5.4.0-65-generic kernel.
>
> Thanks for following up. This is really nasty but now that I am looking
> at it more closely, this is not really different from tmpfs in general.
> You are free to create files and eat the memory without being accounted
> for that memory because that is not seen as your memory from the sysstem
> POV. You would have to map that memory to be part of your rss.
>
> The only existing protection right now is to use memoery cgroup
> controller because the tmpfs memory is accounted to the process which
> faults the memory in (or write to the file).
>
> I am not sure there is a good way to handle this in general
> unfortunatelly. Shmem is is just tricky (e.g. how to you deal with left
> overs after the fd is closed?). Maybe memfd_create can be more clever
> and account memory to all owners of the fd but even that sounds far from
> trivial from the accounting POV. It is true that tmpfs can at least
> control who can write to it which is not the case for memfd but then we
> hit the backward compatibility wall.

Yes, no solution satisfactory, and memcg best, but don't forget
echo 2 >/proc/sys/vm/overcommit_memory

Hugh

2021-02-05 07:57:53

by Christian König

[permalink] [raw]
Subject: Re: Possible deny of service with memfd_create()

Am 05.02.21 um 01:32 schrieb Hugh Dickins:
> On Thu, 4 Feb 2021, Michal Hocko wrote:
>> On Thu 04-02-21 17:32:20, Christian Koenig wrote:
>>> Hi Michal,
>>>
>>> as requested in the other mail thread the following sample code gets my test
>>> system down within seconds.
>>>
>>> The issue is that the memory allocated for the file descriptor is not
>>> accounted to the process allocating it, so the OOM killer pics whatever
>>> process it things is good but never my small test program.
>>>
>>> Since memfd_create() doesn't need any special permission this is a rather
>>> nice deny of service and as far as I can see also works with a standard
>>> Ubuntu 5.4.0-65-generic kernel.
>> Thanks for following up. This is really nasty but now that I am looking
>> at it more closely, this is not really different from tmpfs in general.
>> You are free to create files and eat the memory without being accounted
>> for that memory because that is not seen as your memory from the sysstem
>> POV. You would have to map that memory to be part of your rss.

I mostly agree. The big difference is that tmpfs is only available when
mounted.

And tmpfs can be restricted in size per mount point as well as per user
quotas IIRC. Looking at my desktop system those restrictions are
actually exactly what I see there.

But memfd_create() is just free for all, you don't have any size limit
nor access restriction as far as I can see.

>> The only existing protection right now is to use memoery cgroup
>> controller because the tmpfs memory is accounted to the process which
>> faults the memory in (or write to the file).

Agreed, but having to rely on cgroup is not really satisfying when you
have to maintain a hardened server.

>> I am not sure there is a good way to handle this in general
>> unfortunatelly. Shmem is is just tricky (e.g. how to you deal with left
>> overs after the fd is closed?). Maybe memfd_create can be more clever
>> and account memory to all owners of the fd but even that sounds far from
>> trivial from the accounting POV. It is true that tmpfs can at least
>> control who can write to it which is not the case for memfd but then we
>> hit the backward compatibility wall.
> Yes, no solution satisfactory, and memcg best, but don't forget
> echo 2 >/proc/sys/vm/overcommit_memory

Good point as well.

Regards,
Christian.

>
> Hugh

2021-02-05 10:55:00

by Michal Hocko

[permalink] [raw]
Subject: Re: Possible deny of service with memfd_create()

On Fri 05-02-21 08:54:31, Christian K?nig wrote:
> Am 05.02.21 um 01:32 schrieb Hugh Dickins:
> > On Thu, 4 Feb 2021, Michal Hocko wrote:
> > > On Thu 04-02-21 17:32:20, Christian Koenig wrote:
> > > > Hi Michal,
> > > >
> > > > as requested in the other mail thread the following sample code gets my test
> > > > system down within seconds.
> > > >
> > > > The issue is that the memory allocated for the file descriptor is not
> > > > accounted to the process allocating it, so the OOM killer pics whatever
> > > > process it things is good but never my small test program.
> > > >
> > > > Since memfd_create() doesn't need any special permission this is a rather
> > > > nice deny of service and as far as I can see also works with a standard
> > > > Ubuntu 5.4.0-65-generic kernel.
> > > Thanks for following up. This is really nasty but now that I am looking
> > > at it more closely, this is not really different from tmpfs in general.
> > > You are free to create files and eat the memory without being accounted
> > > for that memory because that is not seen as your memory from the sysstem
> > > POV. You would have to map that memory to be part of your rss.
>
> I mostly agree. The big difference is that tmpfs is only available when
> mounted.
>
> And tmpfs can be restricted in size per mount point as well as per user
> quotas IIRC. Looking at my desktop system those restrictions are actually
> exactly what I see there.

I cannot find anything about per user quotas for tmpfs in the tmpfs man
page. Or maybe I am looking at a wrong layer and there is a generic
handling somewhere in the vfs core?

> But memfd_create() is just free for all, you don't have any size limit nor
> access restriction as far as I can see.

Yes, this is unfortunate and a design decision that should have been
considered when the syscall has been introduced. But this boat has
sailed looong ago to change that without risking a userspace breakage.

> > > The only existing protection right now is to use memoery cgroup
> > > controller because the tmpfs memory is accounted to the process which
> > > faults the memory in (or write to the file).
>
> Agreed, but having to rely on cgroup is not really satisfying when you have
> to maintain a hardened server.

Yes I do recognize the pain. The only other way to mitigate the risk is
to disallow the syscall to untrusted users in a hardened environment.
You should be very strict in tmpfs usage there already.

--
Michal Hocko
SUSE Labs

2021-02-05 12:33:45

by Michal Hocko

[permalink] [raw]
Subject: Re: Possible deny of service with memfd_create()

On Fri 05-02-21 11:57:09, Christian K?nig wrote:
> Am 05.02.21 um 11:50 schrieb Michal Hocko:
> > On Fri 05-02-21 08:54:31, Christian K?nig wrote:
> > > Am 05.02.21 um 01:32 schrieb Hugh Dickins:
> > > > On Thu, 4 Feb 2021, Michal Hocko wrote:
[...]
> > > > > The only existing protection right now is to use memoery cgroup
> > > > > controller because the tmpfs memory is accounted to the process which
> > > > > faults the memory in (or write to the file).
> > > Agreed, but having to rely on cgroup is not really satisfying when you have
> > > to maintain a hardened server.
> > Yes I do recognize the pain. The only other way to mitigate the risk is
> > to disallow the syscall to untrusted users in a hardened environment.
> > You should be very strict in tmpfs usage there already.
> >
>
> Well it is perfectly valid for a process to use as much memory as it wants,
> the problem is that we are not holding the process accountable for it.
>
> As I said we have similar problems with GPU drivers and I think we just need
> a way to do this.
>
> Let me think about it a bit, maybe we can somehow use the file owner for
> this.

There are some land mines on the way to watch for. The most obvious one
would be to not double account populated file with its mapping. Those
two might live in separate processes. So you would need a rmap walk just
to evaluate oom_badness. Also you need to consider files which are not
open anymore or they have been passed through to another process. And
then the question is what to do about them. Killing their owner doesn't
help anything because the file is still left behind. I do expect you
will learn more problems on the way but I definitely do not want to
discourage you from this endeavor.
--
Michal Hocko
SUSE Labs

2021-02-06 00:38:05

by Christian König

[permalink] [raw]
Subject: Re: Possible deny of service with memfd_create()

Am 05.02.21 um 11:50 schrieb Michal Hocko:
> On Fri 05-02-21 08:54:31, Christian König wrote:
>> Am 05.02.21 um 01:32 schrieb Hugh Dickins:
>>> On Thu, 4 Feb 2021, Michal Hocko wrote:
>>>> On Thu 04-02-21 17:32:20, Christian Koenig wrote:
>>>>> Hi Michal,
>>>>>
>>>>> as requested in the other mail thread the following sample code gets my test
>>>>> system down within seconds.
>>>>>
>>>>> The issue is that the memory allocated for the file descriptor is not
>>>>> accounted to the process allocating it, so the OOM killer pics whatever
>>>>> process it things is good but never my small test program.
>>>>>
>>>>> Since memfd_create() doesn't need any special permission this is a rather
>>>>> nice deny of service and as far as I can see also works with a standard
>>>>> Ubuntu 5.4.0-65-generic kernel.
>>>> Thanks for following up. This is really nasty but now that I am looking
>>>> at it more closely, this is not really different from tmpfs in general.
>>>> You are free to create files and eat the memory without being accounted
>>>> for that memory because that is not seen as your memory from the sysstem
>>>> POV. You would have to map that memory to be part of your rss.
>> I mostly agree. The big difference is that tmpfs is only available when
>> mounted.
>>
>> And tmpfs can be restricted in size per mount point as well as per user
>> quotas IIRC. Looking at my desktop system those restrictions are actually
>> exactly what I see there.
> I cannot find anything about per user quotas for tmpfs in the tmpfs man
> page. Or maybe I am looking at a wrong layer and there is a generic
> handling somewhere in the vfs core?

I think so, yes. I briefly remember a discussion about how to implement
quotas for tmpfs, but that was a really long time ago and I didn't
followed it till the end.

>> But memfd_create() is just free for all, you don't have any size limit nor
>> access restriction as far as I can see.
> Yes, this is unfortunate and a design decision that should have been
> considered when the syscall has been introduced. But this boat has
> sailed looong ago to change that without risking a userspace breakage.
>
>>>> The only existing protection right now is to use memoery cgroup
>>>> controller because the tmpfs memory is accounted to the process which
>>>> faults the memory in (or write to the file).
>> Agreed, but having to rely on cgroup is not really satisfying when you have
>> to maintain a hardened server.
> Yes I do recognize the pain. The only other way to mitigate the risk is
> to disallow the syscall to untrusted users in a hardened environment.
> You should be very strict in tmpfs usage there already.
>

Well it is perfectly valid for a process to use as much memory as it
wants, the problem is that we are not holding the process accountable
for it.

As I said we have similar problems with GPU drivers and I think we just
need a way to do this.

Let me think about it a bit, maybe we can somehow use the file owner for
this.

Thanks,
Christian.