Currently, the drop_caches proc file and sysctl read back the last
value written, suggesting this is somehow a stateful setting instead
of a one-time command. Make it write-only, like e.g. compact_memory.
Signed-off-by: Johannes Weiner <[email protected]>
---
kernel/sysctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 31ece1120aa4..50373984a5e2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
.procname = "drop_caches",
.data = &sysctl_drop_caches,
.maxlen = sizeof(int),
- .mode = 0644,
+ .mode = 0200,
.proc_handler = drop_caches_sysctl_handler,
.extra1 = SYSCTL_ONE,
.extra2 = &four,
--
2.23.0
On Thu, 31 Oct 2019 18:16:02 -0400 Johannes Weiner <[email protected]> wrote:
> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
>
> ...
>
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
> .procname = "drop_caches",
> .data = &sysctl_drop_caches,
> .maxlen = sizeof(int),
> - .mode = 0644,
> + .mode = 0200,
> .proc_handler = drop_caches_sysctl_handler,
> .extra1 = SYSCTL_ONE,
> .extra2 = &four,
hm.
Risk: some (odd) userspace code will break. Fixable by manually chmodding
it back again.
Reward: very little.
Is the reward worth the risk?
Johannes Weiner writes:
>Currently, the drop_caches proc file and sysctl read back the last
>value written, suggesting this is somehow a stateful setting instead
>of a one-time command. Make it write-only, like e.g. compact_memory.
>
>Signed-off-by: Johannes Weiner <[email protected]>
Since we already have the kmsg notifier when it's used, this seems reasonable
to me.
Acked-by: Chris Down <[email protected]>
Chris Down writes:
>There is evidence that this has already caused confusion[0] for many,
>judging by the number of views and votes. I think the reward is higher
>than stated here, since it makes the intent and lack of persistent API
>in the API clearer, and less likely to cause confusion in future.
>
>0: https://unix.stackexchange.com/q/17936/10762
s/persistent API/persistent state/
Hm, not sure why my client didn't show this reply.
Andrew Morton writes:
>Risk: some (odd) userspace code will break. Fixable by manually chmodding
>it back again.
The only scenario I can construct in my head is that someone has built
something to watch drop_caches for modification, but we already have the kmsg
output for that.
>Reward: very little.
>
>Is the reward worth the risk?
There is evidence that this has already caused confusion[0] for many, judging
by the number of views and votes. I think the reward is higher than stated
here, since it makes the intent and lack of persistent API in the API clearer,
and less likely to cause confusion in future.
0: https://unix.stackexchange.com/q/17936/10762
On Fri, Nov 01, 2019 at 11:09:01AM +0000, Chris Down wrote:
> Hm, not sure why my client didn't show this reply.
>
> Andrew Morton writes:
> > Risk: some (odd) userspace code will break. Fixable by manually chmodding
> > it back again.
>
> The only scenario I can construct in my head is that someone has built
> something to watch drop_caches for modification, but we already have the
> kmsg output for that.
>
> > Reward: very little.
> >
> > Is the reward worth the risk?
>
> There is evidence that this has already caused confusion[0] for many,
> judging by the number of views and votes. I think the reward is higher than
> stated here, since it makes the intent and lack of persistent API in the API
> clearer, and less likely to cause confusion in future.
>
> 0: https://unix.stackexchange.com/q/17936/10762
Yes, I should have mentioned this in the changelog, but:
While mitigating a VM problem at scale in our fleet, there was
confusion about whether writing to this file will permanently switch
the kernel into a non-caching mode. This influences the decision
making in a tense situation, where tens of people are trying to fix
tens of thousands of affected machines: Do we need a rollback
strategy? What are the performance implications of operating in a
non-caching state for several days? It also caused confusion when the
kernel team said we may need to write the file several times to make
sure it's effective ("But it already reads back 3?").
On Fri, 1 Nov 2019 10:45:40 -0400 Johannes Weiner <[email protected]> wrote:
> On Fri, Nov 01, 2019 at 11:09:01AM +0000, Chris Down wrote:
> > Hm, not sure why my client didn't show this reply.
> >
> > Andrew Morton writes:
> > > Risk: some (odd) userspace code will break. Fixable by manually chmodding
> > > it back again.
> >
> > The only scenario I can construct in my head is that someone has built
> > something to watch drop_caches for modification, but we already have the
> > kmsg output for that.
The scenario is that something opens /proc/sys/vm/drop_caches for
reading, gets unexpected EPERM and blows up?
> > > Reward: very little.
> > >
> > > Is the reward worth the risk?
> >
> > There is evidence that this has already caused confusion[0] for many,
> > judging by the number of views and votes. I think the reward is higher than
> > stated here, since it makes the intent and lack of persistent API in the API
> > clearer, and less likely to cause confusion in future.
> >
> > 0: https://unix.stackexchange.com/q/17936/10762
>
> Yes, I should have mentioned this in the changelog, but:
>
> While mitigating a VM problem at scale in our fleet, there was
> confusion about whether writing to this file will permanently switch
> the kernel into a non-caching mode. This influences the decision
> making in a tense situation, where tens of people are trying to fix
> tens of thousands of affected machines: Do we need a rollback
> strategy? What are the performance implications of operating in a
> non-caching state for several days? It also caused confusion when the
> kernel team said we may need to write the file several times to make
> sure it's effective ("But it already reads back 3?").
OK. What if we make reads always return "0"? That will fix the
misleading output and is more backwards-compatible?
On Fri, 1 Nov 2019 19:24:05 +0000 Chris Down <[email protected]> wrote:
> Andrew Morton writes:
> >> > The only scenario I can construct in my head is that someone has built
> >> > something to watch drop_caches for modification, but we already have the
> >> > kmsg output for that.
> >
> >The scenario is that something opens /proc/sys/vm/drop_caches for
> >reading, gets unexpected EPERM and blows up?
>
> Right, but...
>
> >OK. What if we make reads always return "0"? That will fix the
> >misleading output and is more backwards-compatible?
>
> ...I'm not convinced that if an application has no error boundary for that
> EPERM that it can tolerate a change in behaviour, either. I mean, if it's
> opening it at all, presumably it intends to do *something* based on the value
> (regardless of import or lack thereof). It may do nothing, but it's not
> possible to know whether that's better or worse than blowing up.
>
> I have mixed feelings on this one. Pragmatically, as someone who programs in
> userspace, I'd like failures based on changes in infrastructure to be loud, not
> silent. If I'm doing something which doesn't work, I'd like to know about it.
> Of course, one can make the argument that as a user of such an application,
> sometimes you don't have that luxury.
>
> Either change is an upgrade from the current situation, at least. I prefer
> towards whatever makes the API the least confusing, which appears to be
> Johannes' original change, but I'd support a patch which always set it to
> 0 instead if it was deemed safer.
On the other hand.. As I mentioned earlier, if someone's code is
failing because of the permissions change, they can chmod
/proc/sys/vm/drop_caches at boot time and be happy. They have no such
workaround if their software misbehaves due to a read always returning
"0".
On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <[email protected]> wrote:
> > Either change is an upgrade from the current situation, at least. I prefer
> > towards whatever makes the API the least confusing, which appears to be
> > Johannes' original change, but I'd support a patch which always set it to
> > 0 instead if it was deemed safer.
>
> On the other hand.. As I mentioned earlier, if someone's code is
> failing because of the permissions change, they can chmod
> /proc/sys/vm/drop_caches at boot time and be happy. They have no such
> workaround if their software misbehaves due to a read always returning
> "0".
I lied. I can chmod things in /proc but I can't chmod things in
/proc/sys/vm. Huh, why did we do that?
Andrew Morton writes:
>> > The only scenario I can construct in my head is that someone has built
>> > something to watch drop_caches for modification, but we already have the
>> > kmsg output for that.
>
>The scenario is that something opens /proc/sys/vm/drop_caches for
>reading, gets unexpected EPERM and blows up?
Right, but...
>OK. What if we make reads always return "0"? That will fix the
>misleading output and is more backwards-compatible?
...I'm not convinced that if an application has no error boundary for that
EPERM that it can tolerate a change in behaviour, either. I mean, if it's
opening it at all, presumably it intends to do *something* based on the value
(regardless of import or lack thereof). It may do nothing, but it's not
possible to know whether that's better or worse than blowing up.
I have mixed feelings on this one. Pragmatically, as someone who programs in
userspace, I'd like failures based on changes in infrastructure to be loud, not
silent. If I'm doing something which doesn't work, I'd like to know about it.
Of course, one can make the argument that as a user of such an application,
sometimes you don't have that luxury.
Either change is an upgrade from the current situation, at least. I prefer
towards whatever makes the API the least confusing, which appears to be
Johannes' original change, but I'd support a patch which always set it to
0 instead if it was deemed safer.
On Fri, Nov 01, 2019 at 12:35:44PM -0700, Andrew Morton wrote:
> On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <[email protected]> wrote:
>
> > > Either change is an upgrade from the current situation, at least. I prefer
> > > towards whatever makes the API the least confusing, which appears to be
> > > Johannes' original change, but I'd support a patch which always set it to
> > > 0 instead if it was deemed safer.
> >
> > On the other hand.. As I mentioned earlier, if someone's code is
> > failing because of the permissions change, they can chmod
> > /proc/sys/vm/drop_caches at boot time and be happy. They have no such
> > workaround if their software misbehaves due to a read always returning
> > "0".
>
> I lied. I can chmod things in /proc but I can't chmod things in
> /proc/sys/vm. Huh, why did we do that?
To conserve memory! It was in 2007.
For the record I support 0200 on vm.drop_caches.
commit 77b14db502cb85a031fe8fde6c85d52f3e0acb63
[PATCH] sysctl: reimplement the sysctl proc support
+static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
+{
+ struct inode *inode = dentry->d_inode;
+ int error;
+
+ if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
+ return -EPERM;
Alexey Dobriyan <[email protected]> writes:
> On Fri, Nov 01, 2019 at 12:35:44PM -0700, Andrew Morton wrote:
>> On Fri, 1 Nov 2019 12:29:20 -0700 Andrew Morton <[email protected]> wrote:
>>
>> > > Either change is an upgrade from the current situation, at least. I prefer
>> > > towards whatever makes the API the least confusing, which appears to be
>> > > Johannes' original change, but I'd support a patch which always set it to
>> > > 0 instead if it was deemed safer.
>> >
>> > On the other hand.. As I mentioned earlier, if someone's code is
>> > failing because of the permissions change, they can chmod
>> > /proc/sys/vm/drop_caches at boot time and be happy. They have no such
>> > workaround if their software misbehaves due to a read always returning
>> > "0".
>>
>> I lied. I can chmod things in /proc but I can't chmod things in
>> /proc/sys/vm. Huh, why did we do that?
>
> To conserve memory! It was in 2007.
> For the record I support 0200 on vm.drop_caches.
>
> commit 77b14db502cb85a031fe8fde6c85d52f3e0acb63
> [PATCH] sysctl: reimplement the sysctl proc support
>
> +static int proc_sys_setattr(struct dentry *dentry, struct iattr *attr)
> +{
> + struct inode *inode = dentry->d_inode;
> + int error;
> +
> + if (attr->ia_valid & (ATTR_MODE | ATTR_UID | ATTR_GID))
> + return -EPERM;
Almost.
The rewrite was both to concerve memory and to support the network
namespace. Which required a different view of proc files.
But in this case we have always unconditionally called sysctl_perm. The
change above at best removed a layer of obfuscation that made it look
like some other permission check was being honored.
Eric
On 31.10.19 23:16, Johannes Weiner wrote:
> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> kernel/sysctl.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 31ece1120aa4..50373984a5e2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
> .procname = "drop_caches",
> .data = &sysctl_drop_caches,
> .maxlen = sizeof(int),
> - .mode = 0644,
> + .mode = 0200,
> .proc_handler = drop_caches_sysctl_handler,
> .extra1 = SYSCTL_ONE,
> .extra2 = &four,
>
Makes perfect sense to me (and we might notice while in next/master if
this breaks something, hopefully)
Acked-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 10/31/19 11:16 PM, Johannes Weiner wrote:
> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Generic tools such as "sysctl -a" will be fine as they already are fine
with compact_memory being 0200, and if somebody has a specific script,
they are maybe already making wrong decisions, so at least hopefully
they will learn and adjust.
> ---
> kernel/sysctl.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 31ece1120aa4..50373984a5e2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
> .procname = "drop_caches",
> .data = &sysctl_drop_caches,
> .maxlen = sizeof(int),
> - .mode = 0644,
> + .mode = 0200,
> .proc_handler = drop_caches_sysctl_handler,
> .extra1 = SYSCTL_ONE,
> .extra2 = &four,
>
On Thu 31-10-19 18:16:02, Johannes Weiner wrote:
> Currently, the drop_caches proc file and sysctl read back the last
> value written, suggesting this is somehow a stateful setting instead
> of a one-time command. Make it write-only, like e.g. compact_memory.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Please add a note about the confusion this has caused already.
The link posted by Chris would be useful as well.
Acked-by: Michal Hocko <[email protected]>
> ---
> kernel/sysctl.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 31ece1120aa4..50373984a5e2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1474,7 +1474,7 @@ static struct ctl_table vm_table[] = {
> .procname = "drop_caches",
> .data = &sysctl_drop_caches,
> .maxlen = sizeof(int),
> - .mode = 0644,
> + .mode = 0200,
> .proc_handler = drop_caches_sysctl_handler,
> .extra1 = SYSCTL_ONE,
> .extra2 = &four,
> --
> 2.23.0
--
Michal Hocko
SUSE Labs