2000-11-13 21:36:56

by Szabolcs Szakacsits

[permalink] [raw]
Subject: [PATCH] Re: reliability of linux-vm subsystem


On Mon, Nov 13, 2000 Erik Mouw wrote:
> On Mon, Nov 13, 2000 at 05:29:48PM +0530, [email protected] wrote:
> > System becomes useless till all of the instance of this programming are
> > killed by vmm.
> Good, so the OOM killer works.

But it doesn't work for this kind of application misbehaviours (or
user attacks):

main() { while(1) if (fork()) malloc(1); }

or using IPC shared memory (code by Michal Zalewski)

int i,d=1; char*x; main(){ while(1){ x=shmat(shmget(0,10000000/d,511),0,0);
if(x==-1){ d*=10; continue; } for(i=0;i<10000000/d;i++) if(*(x+i)); } }

Linux 2.[24] "deadlocks" (without quotas). BTW, apparently FreeBSD, OpenBSD,
SCO also become unusable while e.g. Solaris and Tru64 survives (root can
clean up) both in non-overcommit and overcommit mode (no user quotas in
any case).

With the patch below [tried only with 2.2.18pre21 but it's easy to port to
2.4 and should apply to any late 2.2 kernels] Linux should also survive in
both cases without any performance loss (well, trashing would start about
the same time by adding 1.66% extra swap as the original one).

> Sounds quite normal to me. If you don't enforce process limits, you
> allow a normal user to thrash the system.

Home users don't quote themself so they must hit the reset button. Really
is this the maximum that the kernel can do? Also many enterprises expect
the OS won't deadlock in case of application misbehaviours so they don't
have to care about quota setup and can keep the good performance. This
shortcoming^Wfeature of the kernel is one of the reasons Linux is still
considered a toy or hobby OS by many ....

Szaka

PS: The reserved system memory protection could be much better but I'm
pessimistic Linux kernel developers care about this kind of user issues
[it's a longstanding continuous problem, still never tried to be solved].

diff -urw linux-2.2.18pre21/include/linux/sysctl.h linux/include/linux/sysctl.h
--- linux-2.2.18pre21/include/linux/sysctl.h Thu Nov 9 08:20:19 2000
+++ linux/include/linux/sysctl.h Thu Nov 9 06:30:11 2000
@@ -122,7 +122,8 @@
VM_PAGECACHE=7, /* struct: Set cache memory thresholds */
VM_PAGERDAEMON=8, /* struct: Control kswapd behaviour */
VM_PGT_CACHE=9, /* struct: Set page table cache parameters */
- VM_PAGE_CLUSTER=10 /* int: set number of pages to swap together */
+ VM_PAGE_CLUSTER=10, /* int: set number of pages to swap together */
+ VM_RESERVED=11 /* int: number of pages reserved for root */
};


diff -urw linux-2.2.18pre21/ipc/shm.c linux/ipc/shm.c
--- linux-2.2.18pre21/ipc/shm.c Wed Jun 7 17:26:44 2000
+++ linux/ipc/shm.c Mon Nov 13 03:50:51 2000
@@ -101,8 +101,8 @@
return -ENOMEM;
}

- shp->shm_pages = (ulong *) vmalloc (numpages*sizeof(ulong));
- if (!shp->shm_pages) {
+ if (!vm_enough_memory(numpages)
+ || !(shp->shm_pages = (ulong *) vmalloc(numpages*sizeof(ulong)))) {
shm_segs[id] = (struct shmid_kernel *) IPC_UNUSED;
wake_up (&shm_lock);
kfree(shp);
diff -urw linux-2.2.18pre21/kernel/sysctl.c linux/kernel/sysctl.c
--- linux-2.2.18pre21/kernel/sysctl.c Thu Nov 9 08:20:19 2000
+++ linux/kernel/sysctl.c Fri Nov 10 06:29:56 2000
@@ -32,6 +32,7 @@
#if defined(CONFIG_SYSCTL)

/* External variables not in a header file. */
+extern int vm_reserved;
extern int panic_timeout;
extern int console_loglevel, C_A_D;
extern int bdf_prm[], bdflush_min[], bdflush_max[];
@@ -249,6 +250,8 @@
&bdflush_min, &bdflush_max},
{VM_OVERCOMMIT_MEMORY, "overcommit_memory", &sysctl_overcommit_memory,
sizeof(sysctl_overcommit_memory), 0644, NULL, &proc_dointvec},
+ {VM_RESERVED, "reserved",
+ &vm_reserved, sizeof(int), 0644, NULL, &proc_dointvec},
{VM_BUFFERMEM, "buffermem",
&buffer_mem, sizeof(buffer_mem_t), 0644, NULL, &proc_dointvec},
{VM_PAGECACHE, "pagecache",
diff -urw linux-2.2.18pre21/mm/mmap.c linux/mm/mmap.c
--- linux-2.2.18pre21/mm/mmap.c Thu Nov 9 08:20:19 2000
+++ linux/mm/mmap.c Mon Nov 13 11:27:56 2000
@@ -40,6 +40,7 @@
kmem_cache_t *vm_area_cachep;

int sysctl_overcommit_memory;
+int vm_reserved;

/* Check that a process has enough memory to allocate a
* new virtual mapping.
@@ -67,6 +68,8 @@
free += nr_free_pages;
free += nr_swap_pages;
free -= (page_cache.min_percent + buffer_mem.min_percent + 2)*num_physpages/100;
+ if (vm_reserved > 0 && current->uid && free < vm_reserved)
+ return 0;
return free > pages;
}

@@ -872,6 +875,23 @@

void __init vma_init(void)
{
+ struct sysinfo i;
+
+ /*
+ * Setup default reserved VM pages for root. You can tune it
+ * via /proc/sys/vm/reserved. Default value is based on RAM size
+ * - no reserved pages if RAM is less than 8MB
+ * - 5MB should be enough on boxes w/ RAM > 100 MB
+ * - otherwise reserve 5%
+ */
+ si_meminfo(&i);
+ if (i.totalram < 8 * 1024 * 1024)
+ vm_reserved = 0;
+ else if (i.totalram > 100 * 1024 * 1024)
+ vm_reserved = 5 * 1024 * 1024 >> PAGE_SHIFT;
+ else
+ vm_reserved = (i.totalram >> PAGE_SHIFT) / 20;
+
vm_area_cachep = kmem_cache_create("vm_area_struct",
sizeof(struct vm_area_struct),
0, SLAB_HWCACHE_ALIGN,


2000-11-14 00:25:07

by Erik Mouw

[permalink] [raw]
Subject: Re: [PATCH] Re: reliability of linux-vm subsystem

On Mon, Nov 13, 2000 at 10:50:05PM +0100, Szabolcs Szakacsits wrote:
> On Mon, Nov 13, 2000 Erik Mouw wrote:
> > Good, so the OOM killer works.
>
> But it doesn't work for this kind of application misbehaviours (or
> user attacks):
>
> main() { while(1) if (fork()) malloc(1); }

Proper process limits stop the fork bomb.

> or using IPC shared memory (code by Michal Zalewski)
>
> int i,d=1; char*x; main(){ while(1){ x=shmat(shmget(0,10000000/d,511),0,0);
> if(x==-1){ d*=10; continue; } for(i=0;i<10000000/d;i++) if(*(x+i)); } }

I don't remember if this already fixed.

> Linux 2.[24] "deadlocks" (without quotas). BTW, apparently FreeBSD, OpenBSD,
> SCO also become unusable while e.g. Solaris and Tru64 survives (root can
> clean up) both in non-overcommit and overcommit mode (no user quotas in
> any case).
>
> With the patch below [tried only with 2.2.18pre21 but it's easy to port to
> 2.4 and should apply to any late 2.2 kernels] Linux should also survive in
> both cases without any performance loss (well, trashing would start about
> the same time by adding 1.66% extra swap as the original one).

Looks like a nice feature to me. Any VM guru that cares to comment?

> > Sounds quite normal to me. If you don't enforce process limits, you
> > allow a normal user to thrash the system.
>
> Home users don't quote themself so they must hit the reset button. Really
> is this the maximum that the kernel can do? Also many enterprises expect
> the OS won't deadlock in case of application misbehaviours so they don't
> have to care about quota setup and can keep the good performance. This
> shortcoming^Wfeature of the kernel is one of the reasons Linux is still
> considered a toy or hobby OS by many ....

This is a mechanism vs. policy issue. The kernel hands you enough
mechanisms (well, except your patch) to handle misbehaving users. It is
up to the sysadmin to enforce the policy. For the home user it means
that the distribution providers have to set decent limits, for
enterprises it means that they have to hire a sysadmin.


Erik

--
J.A.K. (Erik) Mouw, Information and Communication Theory Group, Department
of Electrical Engineering, Faculty of Information Technology and Systems,
Delft University of Technology, PO BOX 5031, 2600 GA Delft, The Netherlands
Phone: +31-15-2783635 Fax: +31-15-2781843 Email: [email protected]
WWW: http://www-ict.its.tudelft.nl/~erik/

2000-11-14 01:12:47

by Michael Peddemors

[permalink] [raw]
Subject: Re: [PATCH] Re: reliability of linux-vm subsystem

> up to the sysadmin to enforce the policy. For the home user it means
> that the distribution providers have to set decent limits,

What is decent today may not be with tommorows' newest softwares....

> for enterprises it means that they have to hire a sysadmin.

That is one of the reasons that small businesses are afraid to go to Linux
now, because of the difficulty in finding skilled Linux sysadmins..

"At least with the 'XX' Os, all they need to do is hire someone that can
click buttons, either on the computer, or to the tech support line" is the
perception, and with Linux they are already worried enough that they have to
find a 'genius' to work on their systems fulltime..

It would be nice if 'advanced administration' can be kept to the minimum, so
we can service MORE than one enterprise each :>

--------------------------------------------------------
Michael Peddemors - Senior Consultant
Unix?Administration - WebSite Hosting
Network?Services - Programming
Wizard?Internet Services http://www.wizard.ca
Linux Support Specialist - http://www.linuxmagic.com
--------------------------------------------------------
(604)?589-0037 Beautiful British Columbia, Canada
--------------------------------------------------------

2000-11-14 09:58:42

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] Re: reliability of linux-vm subsystem

Michael Peddemors wrote:
>
> > up to the sysadmin to enforce the policy. For the home user it means
> > that the distribution providers have to set decent limits,
>
> What is decent today may not be with tommorows' newest softwares....
>
Which is why you upgrade your distribution now and then. Or have
a script setting a dynamic limit depending on available
memory & swap.

> > for enterprises it means that they have to hire a sysadmin.
>
> That is one of the reasons that small businesses are afraid to go to Linux
> now, because of the difficulty in finding skilled Linux sysadmins..
>
The small business should use the distribution provided limit.

> "At least with the 'XX' Os, all they need to do is hire someone that can
> click buttons, either on the computer, or to the tech support line" is the
> perception, and with Linux they are already worried enough that they have to
> find a 'genius' to work on their systems fulltime..
>
There are tech support lines for linux too, if you _pay_ for a
distribution. You pay if you need it.

> It would be nice if 'advanced administration' can be kept to the minimum, so
> we can service MORE than one enterprise each :>

Sure. My impression is that most of the advanced stuff is in the
installation and initial configuration. There is very little
regular maintenance with linux. Less than your typical GUI os anyway.
Easy installation looses its charm when you have to do it twice or more.

Helge Hafting

2000-11-14 13:15:24

by Szabolcs Szakacsits

[permalink] [raw]
Subject: Re: [PATCH] Re: reliability of linux-vm subsystem


On Tue, 14 Nov 2000, Erik Mouw wrote:
> On Mon, Nov 13, 2000 at 10:50:05PM +0100, Szabolcs Szakacsits wrote:
> > But it doesn't work for this kind of application misbehaviours (or
> > user attacks):
> > main() { while(1) if (fork()) malloc(1); }
> Proper process limits stop the fork bomb.

You completely missed the point. Other OS'es can survive without
process limits, Linux can't. Also fork bomb isn't really an issue
for years, the above is a memory exhaustive fork bomb [mallocing 4096
is more effective/fast]. No guarantee for a user space solution it can
do its job [especially if the kernel kills it] even if there are
"decent" limits otherwise the limits should be so strict that the
system would be close to unusable for the user.

> > or using IPC shared memory (code by Michal Zalewski)
> > int i,d=1; char*x; main(){ while(1){ x=shmat(shmget(0,10000000/d,511),0,0);
> > if(x==-1){ d*=10; continue; } for(i=0;i<10000000/d;i++) if(*(x+i)); } }
> I don't remember if this already fixed.

It was long ago. But process limits aren't enough, you should also set
/proc/sys/kernel/shmall. The default system-wide IPC shared memory
limt is 16 GB (2^34) but you can't use more than SHMMAX*SHMMNI
(2^25+2^7 on i386 by default). So the above code could be modified to
kill any small (< 4 GB RAM) Linux box if /proc/sys/kernel/shmall isn't
set.

> > With the patch below [tried only with 2.2.18pre21 but it's easy to port to
> > 2.4 and should apply to any late 2.2 kernels] Linux should also survive in
> > both cases without any performance loss (well, trashing would start about
> > the same time by adding 1.66% extra swap as the original one).
> Looks like a nice feature to me. Any VM guru that cares to comment?

I'm not a VM guru but I can comment my patch :)

- It's more like a hack not a clean solution. But IMHO a clean
solution is a 2.5 topic.

- Race on SMP. User can acquire the reserved pages for superuser.
Ugly and not foolproof workaround is vm_reserved *= NR_CPU

- OOM killer is interfered. If too much VM pages are reserved so there
is still free on the swap then processes won't be killed, the system
just trashes waiting for root to clean up. Well, this can be even
useful but still potential deadlock if root can't log in for
some reason [e.g. user used up other resources].

- I'm quite sure it's not a perfect solution. Kernel should be looked
through where users can steal additional VM (e.g. network buffers)
or what other resources they can exhaust or block (e.g. file
descriptors, processes are OK).

- Maybe euid=0 and CAP_SYS_ADMIN should be also allowed to use
the reserved memory in case of emergency

- compile time option

- optional real-time priority for root in emergency

> This is a mechanism vs. policy issue. The kernel hands you enough
> mechanisms (well, except your patch) to handle misbehaving users. It is
> up to the sysadmin to enforce the policy. For the home user it means
> that the distribution providers have to set decent limits, for
> enterprises it means that they have to hire a sysadmin.

I advocated about the same in the last 5 years, teaching and helping
people how to setup a safe box [even it's not my job] but you know,
after some time when the complaints are just growing you really start
to feel that something is badly broken .... especially if it could be
solved easily in kernel side saving a lot of training and setup time
and/or money. Still lot's of area where they could be spent [and
please don't get me wrong again, limits are *also* very important].

Szaka

2000-11-14 15:51:26

by Chris Swiedler

[permalink] [raw]
Subject: RE: [PATCH] Re: reliability of linux-vm subsystem

> > Good, so the OOM killer works.
>
> But it doesn't work for this kind of application misbehaviours (or
> user attacks):
>
> main() { while(1) if (fork()) malloc(1); }

This seems to be a fork() bomb, not a VM issue. The system is overwhelmed by
the the forks, not by the space consumed by the allocations themselves. For
one thing, I've found that

main() { while(1) malloc(1024*1024); }

does not kill your system very quickly (if at all). Without actually writing
to the memory, it doesn't seem to be "really" allocated. Adding a memset()
will kill your system much more quickly.

chris