2021-12-20 21:11:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Add accounting for tracking overcommit

Bah, that thread is not on lkml. Please always Cc lkml in the future.

On Mon, Dec 20, 2021 at 12:39:19PM -0800, Kristen Carlson Accardi wrote:
> If a malicious or just extra large enclave is loaded, or even just a
> lot of enclaves, it can eat up all the normal RAM on the system. Normal
> methods of finding out where all the memory on the system is being
> used, wouldn't be able to find this usage since it is shared memory. In
> addition, the OOM killer wouldn't be able to kill any enclaves.

So you need some sort of limiting against malicious enclaves anyways,
regardless of this knob. IOW, you can set a percentage limit of
per-enclave memory which cannot exceed a certain amount which won't
prevent the system from its normal operation. For example.

> I completely agree - so I'm trying to make sure I understand this
> comment, as the value is currently set to default that would
> automatically apply that is based on EPC memory present and not a fixed
> value. So I'd like to understand what you'd like to see done
> differently. are you saying you'd like to eliminated the ability to
> override the automatic default? Or just that you'd rather calculate the
> percentage based on total normal system RAM rather than EPC memory?

So you say that there are cases where swapping to normal RAM can eat
up all RAM.

So the first heuristic should be: do not allow for *all* enclaves
together to use up more than X percent of normal RAM during EPC reclaim.

X percent being, say, 90% of all RAM. For example. I guess 10% should
be enough for normal operation but someone who's more knowledgeable in
system limits could chime in here.

Then, define a per-enclave limit which says, an enclave can use Y % of
memory for swapping when trying to reclaim EPC memory. And that can be
something like:

90 % RAM
--------
total amount of enclaves currently on the system

And you can obviously create scenarios where creating too many enclaves
can bring the system into a situation where it doesn't do any forward
progress.

But you probably can cause the same with overcommitting with VMs so
perhaps it would be a good idea to look how overcommitting VMs and
limits there are handled.

Bottom line is: the logic should be for the most common cases to
function properly, out-of-the-box, without knobs. And then to keep the
system operational by preventing enclaves from bringing it down to a
halt just by doing EPC reclaim.

Does that make more sense?

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


2021-12-20 21:35:08

by Kristen Carlson Accardi

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Add accounting for tracking overcommit

On Mon, 2021-12-20 at 22:11 +0100, Borislav Petkov wrote:
> Bah, that thread is not on lkml. Please always Cc lkml in the future.
>
> On Mon, Dec 20, 2021 at 12:39:19PM -0800, Kristen Carlson Accardi
> wrote:
> > If a malicious or just extra large enclave is loaded, or even just
> > a
> > lot of enclaves, it can eat up all the normal RAM on the system.
> > Normal
> > methods of finding out where all the memory on the system is being
> > used, wouldn't be able to find this usage since it is shared
> > memory. In
> > addition, the OOM killer wouldn't be able to kill any enclaves.
>
> So you need some sort of limiting against malicious enclaves anyways,
> regardless of this knob. IOW, you can set a percentage limit of
> per-enclave memory which cannot exceed a certain amount which won't
> prevent the system from its normal operation. For example.
>
> > I completely agree - so I'm trying to make sure I understand this
> > comment, as the value is currently set to default that would
> > automatically apply that is based on EPC memory present and not a
> > fixed
> > value. So I'd like to understand what you'd like to see done
> > differently. are you saying you'd like to eliminated the ability to
> > override the automatic default? Or just that you'd rather calculate
> > the
> > percentage based on total normal system RAM rather than EPC memory?
>
> So you say that there are cases where swapping to normal RAM can eat
> up all RAM.
>
> So the first heuristic should be: do not allow for *all* enclaves
> together to use up more than X percent of normal RAM during EPC
> reclaim.

So, in your proposal, you would first change the calculated number of
maximum available backing pages to be based on total system RAM rather
than EPC memory, got it.

>
> X percent being, say, 90% of all RAM. For example. I guess 10% should
> be enough for normal operation but someone who's more knowledgeable
> in
> system limits could chime in here.
>
> Then, define a per-enclave limit which says, an enclave can use Y %
> of
> memory for swapping when trying to reclaim EPC memory. And that can
> be
> something like:
>
> 90 % RAM
> --------
> total amount of enclaves currently on the system
>

This would require recalculating the max number of allowed backing
pages per enclave at run time whenever a new enclave is loaded - but
all the existing enclaves may have already used more than the new max
number of per-enclave allowable pages. How would you handle that
scenario? This would add a lot of complexity for sure - and it does
make me wonder whether any additional benefit of limiting per enclave
would be worth it.

> And you can obviously create scenarios where creating too many
> enclaves
> can bring the system into a situation where it doesn't do any forward
> progress.
>
> But you probably can cause the same with overcommitting with VMs so
> perhaps it would be a good idea to look how overcommitting VMs and
> limits there are handled.
>
> Bottom line is: the logic should be for the most common cases to
> function properly, out-of-the-box, without knobs. And then to keep
> the
> system operational by preventing enclaves from bringing it down to a
> halt just by doing EPC reclaim.
>
> Does that make more sense?
>

Thanks for your more detailed explanation - I will take a look at the
VM overcommit limits. Since obviously the original implementation did
have a default value set, I had still a remaining specific question
about your comments. Are you suggesting that there should not be a way
to override any overcommit limit at all? So drop the parameter all
together?


2021-12-20 22:48:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Add accounting for tracking overcommit

On Mon, Dec 20, 2021 at 01:35:03PM -0800, Kristen Carlson Accardi wrote:
> So, in your proposal you would first change the calculated number of
> maximum available backing pages to be based on total system RAM rather
> than EPC memory, got it.

This was just an example. My point is to try to make it automatic and
not introduce another knob. And to decide on the limits and behavior
by using common sense and addressing the common use cases first.

> This would require recalculating the max number of allowed backing
> pages per enclave at run time whenever a new enclave is loaded - but
> all the existing enclaves may have already used more than the new max
> number of per-enclave allowable pages. How would you handle that
> scenario? This would add a lot of complexity for sure - and it does
> make me wonder whether any additional benefit of limiting per enclave
> would be worth it.

See above - this was just an example. And as you've shown, an example of
what *not* to do.

> Thanks for your more detailed explanation - I will take a look at the
> VM overcommit limits. Since obviously the original implementation did
> have a default value set, I had still a remaining specific question
> about your comments. Are you suggesting that there should not be a way
> to override any overcommit limit at all? So drop the parameter all
> together?

So let me ask you a counter-question:

Imagine you're a sysadmin. Or a general, common system user if there
ever is one.

When your system starts thrashing and you're running a bunch of
enclaves, how do you find out there even is a knob which might
potentially help you?

And after you find it, how would you use that knob?

Or would you rather prefer that the system did the right thing for you
instead of having to figure out what the sensible values for that knob
would be?

My point is: put yourself in the user's shoes and try to think about
what would be the optimal thing the system should do.

Once that is cleanly and properly defined, then we can deal with knobs
and policies etc.

I sincerely hope that makes more sense.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-12-21 15:53:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Add accounting for tracking overcommit

On 12/20/21 2:48 PM, Borislav Petkov wrote:
> On Mon, Dec 20, 2021 at 01:35:03PM -0800, Kristen Carlson Accardi wrote:
>> So, in your proposal you would first change the calculated number of
>> maximum available backing pages to be based on total system RAM rather
>> than EPC memory, got it.
>
> This was just an example. My point is to try to make it automatic and
> not introduce another knob. And to decide on the limits and behavior
> by using common sense and addressing the common use cases first.

The common case is clearly a few enclaves on systems where the
overcommit levels are modest. The "100%" limit will work great there.

I'd personally be fine with just enforcing that limit as the default and
ignoring everything else. It makes me a bit nervous, though, because
suddenly enforcing a limit is an ABI break. The tunable effectively
gives us a way out if we screw up either the limit's quantity or someone
needs the old ABI back.

That said, we don't *need* it to be tunable, boot parameter or not. If
you're concerned about the tunable, I think we should drop it and not
look back.

> Imagine you're a sysadmin. Or a general, common system user if there
> ever is one.
>
> When your system starts thrashing and you're running a bunch of
> enclaves, how do you find out there even is a knob which might
> potentially help you?

I'm selfish. The tunable isn't for end users. It's for me.

The scenario I envisioned is that a user upgrades to a new kernel and
their enclaves start crashing. They'll eventually find us, the
maintainers of the SGX code, and we'll have a tool as kernel developers
to help them. The tunable helps _me_ in two ways:
1. It help me easily get user back to pre-5.17 (or whatever) behavior
2. If we got the "100%" value wrong, end users can help us experiment to
help find a better value.

BTW, all the chat about "malicious" enclaves and so forth... I
*totally* agree that this is a problem and one worth solving. It just
can't be solved today. We need real cgroup support. It's coming soon.

2021-12-22 14:21:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/sgx: Add accounting for tracking overcommit

On 12/20/21 1:35 PM, Kristen Carlson Accardi wrote:
> Thanks for your more detailed explanation - I will take a look at the
> VM overcommit limits. Since obviously the original implementation did
> have a default value set, I had still a remaining specific question
> about your comments. Are you suggesting that there should not be a way
> to override any overcommit limit at all? So drop the parameter all
> together?

Yes, let's kill the exposed tunable.

We don't have any strong, practical need for it at the moment other than
paranoia.