Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757101AbZABEHZ (ORCPT ); Thu, 1 Jan 2009 23:07:25 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755792AbZABEHM (ORCPT ); Thu, 1 Jan 2009 23:07:12 -0500 Received: from rcsinet12.oracle.com ([148.87.113.124]:30502 "EHLO rgminet12.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750911AbZABEHH (ORCPT ); Thu, 1 Jan 2009 23:07:07 -0500 Message-ID: <495D9222.1060306@oracle.com> Date: Thu, 01 Jan 2009 20:03:46 -0800 From: Randy Dunlap Organization: Oracle Linux Engineering User-Agent: Thunderbird 2.0.0.6 (X11/20070801) MIME-Version: 1.0 To: Peter W Morreale CC: linux-kernel@vger.kernel.org, comandante@zaralinux.com, bb@ricochet.net, Rik van Riel , linux-mm@kvack.org, Andrew Morton Subject: Re: [PATCH] Update of Documentation/ (VM sysctls) References: <20081231212615.12868.97088.stgit@hermosa.site> In-Reply-To: <20081231212615.12868.97088.stgit@hermosa.site> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Source-IP: acsmt705.oracle.com [141.146.40.83] X-Auth-Type: Internal IP X-CT-RefId: str=0001.0A090209.495D9227.0248:SCFSTAT928724,ss=1,fgs=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 44364 Lines: 1082 Peter W Morreale wrote: > This patch updates Documentation/sysctl/vm.txt and > Documentation/filesystems/proc.txt. More specifically, the section on > /proc/sys/vm in Documentation/filesystems/proc.txt was removed and a > link to Documentation/sysctl/vm.txt added. Hi mm people: This patch moves all vm sysctl help text from Documentation/filesystems/proc.txt to Documentation/sysctl/vm.txt. Parts of it were duplicated in those 2 files, but there were also some missing docs for (newer) sysctls. I.e., those files hadn't been updated in quite a long time. Acked-by: Randy Dunlap Thanks, Peter. > Most of the verbiage from proc.txt was simply moved in vm.txt, with new > addtional text for "swappiness" and "stat_interval". > > This update reflects the current state of 2.6.27. > > It assumes that patch: http://lkml.org/lkml/2008/12/31/219 has been applied. > This is probably wrong since that patch is still being reviewed and not > officially accepted as of this patch. Not sure how to handle this at > all. Yes, this patch should be done first/regardless of your other (pending) patch. > Comments welcome. > > -PWM > --- > > Signed-off-by: Peter W Morreale > > Documentation/filesystems/proc.txt | 265 ---------------- > Documentation/sysctl/vm.txt | 604 +++++++++++++++++++++++++----------- > 2 files changed, 422 insertions(+), 447 deletions(-) > > diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt > index f566ad9..6e6afe9 100644 > --- a/Documentation/filesystems/proc.txt > +++ b/Documentation/filesystems/proc.txt > @@ -5,9 +5,11 @@ > Bodo Bauer > > 2.4.x update Jorge Nerin November 14 2000 > +2.6.x update Peter W. Morreale December 31 2008 > ------------------------------------------------------------------------------ > -Version 1.3 Kernel version 2.2.12 > +Version 1.4 Kernel version 2.2.12 > Kernel version 2.4.0-test11-pre4 > + section 2.4 update to 2.6.27 > ------------------------------------------------------------------------------ > > Table of Contents > @@ -1362,265 +1364,8 @@ auto_msgmni default value is 1. > 2.4 /proc/sys/vm - The virtual memory subsystem > ----------------------------------------------- > > -The files in this directory can be used to tune the operation of the virtual > -memory (VM) subsystem of the Linux kernel. > - > -vfs_cache_pressure > ------------------- > - > -Controls the tendency of the kernel to reclaim the memory which is used for > -caching of directory and inode objects. > - > -At the default value of vfs_cache_pressure=100 the kernel will attempt to > -reclaim dentries and inodes at a "fair" rate with respect to pagecache and > -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer > -to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 > -causes the kernel to prefer to reclaim dentries and inodes. > - > -dirty_background_ratio > ----------------------- > - > -Contains, as a percentage of total system memory, the number of pages at which > -the pdflush background writeback daemon will start writing out dirty data. > - > -dirty_ratio > ------------------ > - > -Contains, as a percentage of total system memory, the number of pages at which > -a process which is generating disk writes will itself start writing out dirty > -data. > - > -dirty_writeback_centisecs > -------------------------- > - > -The pdflush writeback daemons will periodically wake up and write `old' data > -out to disk. This tunable expresses the interval between those wakeups, in > -100'ths of a second. > - > -Setting this to zero disables periodic writeback altogether. > - > -dirty_expire_centisecs > ----------------------- > - > -This tunable is used to define when dirty data is old enough to be eligible > -for writeout by the pdflush daemons. It is expressed in 100'ths of a second. > -Data which has been dirty in-memory for longer than this interval will be > -written out next time a pdflush daemon wakes up. > - > -highmem_is_dirtyable > --------------------- > - > -Only present if CONFIG_HIGHMEM is set. > - > -This defaults to 0 (false), meaning that the ratios set above are calculated > -as a percentage of lowmem only. This protects against excessive scanning > -in page reclaim, swapping and general VM distress. > - > -Setting this to 1 can be useful on 32 bit machines where you want to make > -random changes within an MMAPed file that is larger than your available > -lowmem without causing large quantities of random IO. Is is safe if the > -behavior of all programs running on the machine is known and memory will > -not be otherwise stressed. > - > -legacy_va_layout > ----------------- > - > -If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel > -will use the legacy (2.4) layout for all processes. > - > -lowmem_reserve_ratio > ---------------------- > - > -For some specialised workloads on highmem machines it is dangerous for > -the kernel to allow process memory to be allocated from the "lowmem" > -zone. This is because that memory could then be pinned via the mlock() > -system call, or by unavailability of swapspace. > - > -And on large highmem machines this lack of reclaimable lowmem memory > -can be fatal. > - > -So the Linux page allocator has a mechanism which prevents allocations > -which _could_ use highmem from using too much lowmem. This means that > -a certain amount of lowmem is defended from the possibility of being > -captured into pinned user memory. > - > -(The same argument applies to the old 16 megabyte ISA DMA region. This > -mechanism will also defend that region from allocations which could use > -highmem or lowmem). > - > -The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is > -in defending these lower zones. > - > -If you have a machine which uses highmem or ISA DMA and your > -applications are using mlock(), or if you are running with no swap then > -you probably should change the lowmem_reserve_ratio setting. > - > -The lowmem_reserve_ratio is an array. You can see them by reading this file. > -- > -% cat /proc/sys/vm/lowmem_reserve_ratio > -256 256 32 > -- > -Note: # of this elements is one fewer than number of zones. Because the highest > - zone's value is not necessary for following calculation. > - > -But, these values are not used directly. The kernel calculates # of protection > -pages for each zones from them. These are shown as array of protection pages > -in /proc/zoneinfo like followings. (This is an example of x86-64 box). > -Each zone has an array of protection pages like this. > - > -- > -Node 0, zone DMA > - pages free 1355 > - min 3 > - low 3 > - high 4 > - : > - : > - numa_other 0 > - protection: (0, 2004, 2004, 2004) > - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > - pagesets > - cpu: 0 pcp: 0 > - : > -- > -These protections are added to score to judge whether this zone should be used > -for page allocation or should be reclaimed. > - > -In this example, if normal pages (index=2) are required to this DMA zone and > -pages_high is used for watermark, the kernel judges this zone should not be > -used because pages_free(1355) is smaller than watermark + protection[2] > -(4 + 2004 = 2008). If this protection value is 0, this zone would be used for > -normal page requirement. If requirement is DMA zone(index=0), protection[0] > -(=0) is used. > - > -zone[i]'s protection[j] is calculated by following expression. > - > -(i < j): > - zone[i]->protection[j] > - = (total sums of present_pages from zone[i+1] to zone[j] on the node) > - / lowmem_reserve_ratio[i]; > -(i = j): > - (should not be protected. = 0; > -(i > j): > - (not necessary, but looks 0) > - > -The default values of lowmem_reserve_ratio[i] are > - 256 (if zone[i] means DMA or DMA32 zone) > - 32 (others). > -As above expression, they are reciprocal number of ratio. > -256 means 1/256. # of protection pages becomes about "0.39%" of total present > -pages of higher zones on the node. > - > -If you would like to protect more pages, smaller values are effective. > -The minimum value is 1 (1/1 -> 100%). > - > -page-cluster > ------------- > - > -page-cluster controls the number of pages which are written to swap in > -a single attempt. The swap I/O size. > - > -It is a logarithmic value - setting it to zero means "1 page", setting > -it to 1 means "2 pages", setting it to 2 means "4 pages", etc. > - > -The default value is three (eight pages at a time). There may be some > -small benefits in tuning this to a different value if your workload is > -swap-intensive. > - > -overcommit_memory > ------------------ > - > -Controls overcommit of system memory, possibly allowing processes > -to allocate (but not use) more memory than is actually available. > - > - > -0 - Heuristic overcommit handling. Obvious overcommits of > - address space are refused. Used for a typical system. It > - ensures a seriously wild allocation fails while allowing > - overcommit to reduce swap usage. root is allowed to > - allocate slightly more memory in this mode. This is the > - default. > - > -1 - Always overcommit. Appropriate for some scientific > - applications. > - > -2 - Don't overcommit. The total address space commit > - for the system is not permitted to exceed swap plus a > - configurable percentage (default is 50) of physical RAM. > - Depending on the percentage you use, in most situations > - this means a process will not be killed while attempting > - to use already-allocated memory but will receive errors > - on memory allocation as appropriate. > - > -overcommit_ratio > ----------------- > - > -Percentage of physical memory size to include in overcommit calculations > -(see above.) > - > -Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) > - > - swapspace = total size of all swap areas > - physmem = size of physical memory in system > - > -nr_hugepages and hugetlb_shm_group > ----------------------------------- > - > -nr_hugepages configures number of hugetlb page reserved for the system. > - > -hugetlb_shm_group contains group id that is allowed to create SysV shared > -memory segment using hugetlb page. > - > -hugepages_treat_as_movable > --------------------------- > - > -This parameter is only useful when kernelcore= is specified at boot time to > -create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages > -are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero > -value written to hugepages_treat_as_movable allows huge pages to be allocated > -from ZONE_MOVABLE. > - > -Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge > -pages pool can easily grow or shrink within. Assuming that applications are > -not running that mlock() a lot of memory, it is likely the huge pages pool > -can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value > -into nr_hugepages and triggering page reclaim. > - > -laptop_mode > ------------ > - > -laptop_mode is a knob that controls "laptop mode". All the things that are > -controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. > - > -block_dump > ----------- > - > -block_dump enables block I/O debugging when set to a nonzero value. More > -information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. > - > -swap_token_timeout > ------------------- > - > -This file contains valid hold time of swap out protection token. The Linux > -VM has token based thrashing control mechanism and uses the token to prevent > -unnecessary page faults in thrashing situation. The unit of the value is > -second. The value would be useful to tune thrashing behavior. > - > -drop_caches > ------------ > - > -Writing to this will cause the kernel to drop clean caches, dentries and > -inodes from memory, causing that memory to become free. > - > -To free pagecache: > - echo 1 > /proc/sys/vm/drop_caches > -To free dentries and inodes: > - echo 2 > /proc/sys/vm/drop_caches > -To free pagecache, dentries and inodes: > - echo 3 > /proc/sys/vm/drop_caches > - > -As this is a non-destructive operation and dirty objects are not freeable, the > -user should run `sync' first. > +Please refer to: Documentation/sysctl/vm.txt for a complete description > +of these controls. > > > 2.5 /proc/sys/dev - Device specific parameters > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index c2a257a..50515f9 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -1,12 +1,13 @@ > -Documentation for /proc/sys/vm/* kernel version 2.2.10 > +Documentation for /proc/sys/vm/* kernel version 2.6.27 > (c) 1998, 1999, Rik van Riel > + (c) 2008 Peter W. Morreale > > For general info and legal blurb, please look in README. > > ============================================================== > > This file contains the documentation for the sysctl files in > -/proc/sys/vm and is valid for Linux kernel version 2.2. > +/proc/sys/vm and is valid for Linux kernel version 2.6.27. > > The files in this directory can be used to tune the operation > of the virtual memory (VM) subsystem of the Linux kernel and > @@ -16,109 +17,223 @@ Default values and initialization routines for most of these > files can be found in mm/swap.c. > > Currently, these files are in /proc/sys/vm: > -- overcommit_memory > -- page-cluster > -- dirty_ratio > + > +- block_dump > - dirty_background_ratio > - dirty_expire_centisecs > +- dirty_ratio > - dirty_writeback_centisecs > -- nr_pdflush_threads_min > -- nr_pdflush_threads_max > -- highmem_is_dirtyable (only if CONFIG_HIGHMEM set) > +- drop_caches > +- hugepages_treat_as_movable > +- hugetlb_shm_group > +- laptop_mode > +- legacy_va_layout > +- lowmem_reserve_ratio > - max_map_count > - min_free_kbytes > -- laptop_mode > -- block_dump > -- drop-caches > -- zone_reclaim_mode > -- min_unmapped_ratio > - min_slab_ratio > -- panic_on_oom > -- oom_dump_tasks > -- oom_kill_allocating_task > -- mmap_min_address > -- numa_zonelist_order > +- min_unmapped_ratio > +- mmap_min_addr > - nr_hugepages > - nr_overcommit_hugepages > +- nr_pdflush_threads > +- nr_pdflush_threads_max > +- nr_pdflush_threads_min > +- numa_zonelist_order > +- oom_dump_tasks > +- oom_kill_allocating_task > +- overcommit_memory > +- overcommit_ratio > +- page-cluster > +- panic_on_oom > +- percpu_pagelist_fraction > +- stat_interval > +- swappiness > +- vfs_cache_pressure > +- zone_reclaim_mode > + > + > +============================================================== > + > +block_dump > + > +block_dump enables block I/O debugging when set to a nonzero value. More > +information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. > > ============================================================== > > -dirty_ratio, dirty_background_ratio, dirty_expire_centisecs, > -dirty_writeback_centisecs, highmem_is_dirtyable, > -vfs_cache_pressure, laptop_mode, block_dump, swap_token_timeout, > -drop-caches, hugepages_treat_as_movable: > +dirty_background_ratio > > -See Documentation/filesystems/proc.txt > +Contains, as a percentage of total system memory, the number of pages at which > +the pdflush background writeback daemon will start writing out dirty data. > > ============================================================== > > -nr_pdflush_threads_min > +dirty_expire_centisecs > > -This value controls the minimum number of pdflush threads. > +This tunable is used to define when dirty data is old enough to be eligible > +for writeout by the pdflush daemons. It is expressed in 100'ths of a second. > +Data which has been dirty in-memory for longer than this interval will be > +written out next time a pdflush daemon wakes up. > > -At boot time, the kernel will create and maintain 'nr_pdflush_threads_min' > -threads for the kernel's lifetime. > +============================================================== > > -The default value is 2. The minimum value you can specify is 1, and > -the maximum value is the current setting of 'nr_pdflush_threads_max'. > +dirty_ratio > > -See 'nr_pdflush_threads_max' below for more information. > +Contains, as a percentage of total system memory, the number of pages at which > +a process which is generating disk writes will itself start writing out dirty > +data. > > ============================================================== > > -nr_pdflush_threads_max > +dirty_writeback_centisecs > > -This value controls the maximum number of pdflush threads that can be > -created. The pdflush algorithm will create a new pdflush thread (up to > -this maximum) if no pdflush threads have been available for >= 1 second. > +The pdflush writeback daemons will periodically wake up and write `old' data > +out to disk. This tunable expresses the interval between those wakeups, in > +100'ths of a second. > > -The default value is 8. The minimum value you can specify is the > -current value of 'nr_pdflush_threads_min' and the > -maximum is 1000. > +Setting this to zero disables periodic writeback altogether. > > ============================================================== > > -overcommit_memory: > +drop_caches > > -This value contains a flag that enables memory overcommitment. > +Writing to this will cause the kernel to drop clean caches, dentries and > +inodes from memory, causing that memory to become free. > > -When this flag is 0, the kernel attempts to estimate the amount > -of free memory left when userspace requests more memory. > +To free pagecache: > + echo 1 > /proc/sys/vm/drop_caches > +To free dentries and inodes: > + echo 2 > /proc/sys/vm/drop_caches > +To free pagecache, dentries and inodes: > + echo 3 > /proc/sys/vm/drop_caches > > -When this flag is 1, the kernel pretends there is always enough > -memory until it actually runs out. > +As this is a non-destructive operation and dirty objects are not freeable, the > +user should run `sync' first. > > -When this flag is 2, the kernel uses a "never overcommit" > -policy that attempts to prevent any overcommit of memory. > +============================================================== > > -This feature can be very useful because there are a lot of > -programs that malloc() huge amounts of memory "just-in-case" > -and don't use much of it. > +hugepages_treat_as_movable > > -The default value is 0. > +This parameter is only useful when kernelcore= is specified at boot time to > +create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages > +are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero > +value written to hugepages_treat_as_movable allows huge pages to be allocated > +from ZONE_MOVABLE. > > -See Documentation/vm/overcommit-accounting and > -security/commoncap.c::cap_vm_enough_memory() for more information. > +Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge > +pages pool can easily grow or shrink within. Assuming that applications are > +not running that mlock() a lot of memory, it is likely the huge pages pool > +can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value > +into nr_hugepages and triggering page reclaim. > > ============================================================== > > -overcommit_ratio: > +hugetlb_shm_group > > -When overcommit_memory is set to 2, the committed address > -space is not permitted to exceed swap plus this percentage > -of physical RAM. See above. > +hugetlb_shm_group contains group id that is allowed to create SysV > +shared memory segment using hugetlb page. > > ============================================================== > > -page-cluster: > +laptop_mode > + > +laptop_mode is a knob that controls "laptop mode". All the things that are > +controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. > > -The Linux VM subsystem avoids excessive disk seeks by reading > -multiple pages on a page fault. The number of pages it reads > -is dependent on the amount of memory in your machine. > +============================================================== > > -The number of pages the kernel reads in at once is equal to > -2 ^ page-cluster. Values above 2 ^ 5 don't make much sense > -for swap because we only cluster swap data in 32-page groups. > +legacy_va_layout > + > +If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel > +will use the legacy (2.4) layout for all processes. > + > +============================================================== > + > +lowmem_reserve_ratio > + > +For some specialised workloads on highmem machines it is dangerous for > +the kernel to allow process memory to be allocated from the "lowmem" > +zone. This is because that memory could then be pinned via the mlock() > +system call, or by unavailability of swapspace. > + > +And on large highmem machines this lack of reclaimable lowmem memory > +can be fatal. > + > +So the Linux page allocator has a mechanism which prevents allocations > +which _could_ use highmem from using too much lowmem. This means that > +a certain amount of lowmem is defended from the possibility of being > +captured into pinned user memory. > + > +(The same argument applies to the old 16 megabyte ISA DMA region. This > +mechanism will also defend that region from allocations which could use > +highmem or lowmem). > + > +The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is > +in defending these lower zones. > + > +If you have a machine which uses highmem or ISA DMA and your > +applications are using mlock(), or if you are running with no swap then > +you probably should change the lowmem_reserve_ratio setting. > + > +The lowmem_reserve_ratio is an array. You can see them by reading this file. > +- > +% cat /proc/sys/vm/lowmem_reserve_ratio > +256 256 32 > +- > +Note: # of this elements is one fewer than number of zones. Because the highest > + zone's value is not necessary for following calculation. > + > +But, these values are not used directly. The kernel calculates # of protection > +pages for each zones from them. These are shown as array of protection pages > +in /proc/zoneinfo like followings. (This is an example of x86-64 box). > +Each zone has an array of protection pages like this. > + > +- > +Node 0, zone DMA > + pages free 1355 > + min 3 > + low 3 > + high 4 > + : > + : > + numa_other 0 > + protection: (0, 2004, 2004, 2004) > + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > + pagesets > + cpu: 0 pcp: 0 > + : > +- > +These protections are added to score to judge whether this zone should be used > +for page allocation or should be reclaimed. > + > +In this example, if normal pages (index=2) are required to this DMA zone and > +pages_high is used for watermark, the kernel judges this zone should not be > +used because pages_free(1355) is smaller than watermark + protection[2] > +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for > +normal page requirement. If requirement is DMA zone(index=0), protection[0] > +(=0) is used. > + > +zone[i]'s protection[j] is calculated by following expression. > + > +(i < j): > + zone[i]->protection[j] > + = (total sums of present_pages from zone[i+1] to zone[j] on the node) > + / lowmem_reserve_ratio[i]; > +(i = j): > + (should not be protected. = 0; > +(i > j): > + (not necessary, but looks 0) > + > +The default values of lowmem_reserve_ratio[i] are > + 256 (if zone[i] means DMA or DMA32 zone) > + 32 (others). > +As above expression, they are reciprocal number of ratio. > +256 means 1/256. # of protection pages becomes about "0.39%" of total present > +pages of higher zones on the node. > + > +If you would like to protect more pages, smaller values are effective. > +The minimum value is 1 (1/1 -> 100%). > > ============================================================== > > @@ -150,116 +265,149 @@ become subtly broken, and prone to deadlock under high loads. > > Setting this too high will OOM your machine instantly. > > +============================================================= > + > +min_slab_ratio: > + > +This is available only on NUMA kernels. > + > +A percentage of the total pages in each zone. On Zone reclaim > +(fallback from the local zone occurs) slabs will be reclaimed if more > +than this percentage of pages in a zone are reclaimable slab pages. > +This insures that the slab growth stays under control even in NUMA > +systems that rarely perform global reclaim. > + > +The default is 5 percent. > + > +Note that slab reclaim is triggered in a per zone / node fashion. > +The process of reclaiming slab memory is currently not node specific > +and may not be fast. > + > +============================================================= > + > +min_unmapped_ratio: > + > +This is available only on NUMA kernels. > + > +A percentage of the total pages in each zone. Zone reclaim will only > +occur if more than this percentage of pages are file backed and unmapped. > +This is to insure that a minimal amount of local pages is still available for > +file I/O even if the node is overallocated. > + > +The default is 1 percent. > + > ============================================================== > > -percpu_pagelist_fraction > +mmap_min_addr > > -This is the fraction of pages at most (high mark pcp->high) in each zone that > -are allocated for each per cpu page list. The min value for this is 8. It > -means that we don't allow more than 1/8th of pages in each zone to be > -allocated in any single per_cpu_pagelist. This entry only changes the value > -of hot per cpu pagelists. User can specify a number like 100 to allocate > -1/100th of each zone to each per cpu page list. > +This file indicates the amount of address space which a user process will > +be restricted from mmaping. Since kernel null dereference bugs could > +accidentally operate based on the information in the first couple of pages > +of memory userspace processes should not be allowed to write to them. By > +default this value is set to 0 and no protections will be enforced by the > +security module. Setting this value to something like 64k will allow the > +vast majority of applications to work correctly and provide defense in depth > +against future potential kernel bugs. > > -The batch value of each per cpu pagelist is also updated as a result. It is > -set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) > +============================================================== > > -The initial value is zero. Kernel does not use this value at boot time to set > -the high water marks for each per cpu page list. > +nr_hugepages > > -=============================================================== > +Change the minimum size of the hugepage pool. > > -zone_reclaim_mode: > +See Documentation/vm/hugetlbpage.txt > > -Zone_reclaim_mode allows someone to set more or less aggressive approaches to > -reclaim memory when a zone runs out of memory. If it is set to zero then no > -zone reclaim occurs. Allocations will be satisfied from other zones / nodes > -in the system. > +============================================================== > > -This is value ORed together of > +nr_overcommit_hugepages > > -1 = Zone reclaim on > -2 = Zone reclaim writes dirty pages out > -4 = Zone reclaim swaps pages > +Change the maximum size of the hugepage pool. The maximum is > +nr_hugepages + nr_overcommit_hugepages. > > -zone_reclaim_mode is set during bootup to 1 if it is determined that pages > -from remote zones will cause a measurable performance reduction. The > -page allocator will then reclaim easily reusable pages (those page > -cache pages that are currently not used) before allocating off node pages. > +See Documentation/vm/hugetlbpage.txt > > -It may be beneficial to switch off zone reclaim if the system is > -used for a file server and all of memory should be used for caching files > -from disk. In that case the caching effect is more important than > -data locality. > +============================================================== > > -Allowing zone reclaim to write out pages stops processes that are > -writing large amounts of data from dirtying pages on other nodes. Zone > -reclaim will write out dirty pages if a zone fills up and so effectively > -throttle the process. This may decrease the performance of a single process > -since it cannot use all of system memory to buffer the outgoing writes > -anymore but it preserve the memory on other nodes so that the performance > -of other processes running on other nodes will not be affected. > +nr_pdflush_threads > > -Allowing regular swap effectively restricts allocations to the local > -node unless explicitly overridden by memory policies or cpuset > -configurations. > +The current number of pdflush threads. This value is read-only. > +The value changes according to the number of dirty pages in the system. > > -============================================================= > +When neccessary, additional pdflush threads are created, one per second, up to > +nr_pdflush_threads_max. > > -min_unmapped_ratio: > +============================================================== > > -This is available only on NUMA kernels. > +nr_pdflush_threads_min > > -A percentage of the total pages in each zone. Zone reclaim will only > -occur if more than this percentage of pages are file backed and unmapped. > -This is to insure that a minimal amount of local pages is still available for > -file I/O even if the node is overallocated. > +This value controls the minimum number of pdflush threads. > > -The default is 1 percent. > +At boot time, the kernel will create and maintain 'nr_pdflush_threads_min' > +threads for the kernel's lifetime. > > -============================================================= > +The default value is 2. The minimum value you can specify is 1, and > +the maximum value is the current setting of 'nr_pdflush_threads_max'. > > -min_slab_ratio: > +See 'nr_pdflush_threads_max' below for more information. > > -This is available only on NUMA kernels. > +============================================================== > > -A percentage of the total pages in each zone. On Zone reclaim > -(fallback from the local zone occurs) slabs will be reclaimed if more > -than this percentage of pages in a zone are reclaimable slab pages. > -This insures that the slab growth stays under control even in NUMA > -systems that rarely perform global reclaim. > +nr_pdflush_threads_max > > -The default is 5 percent. > +This value controls the maximum number of pdflush threads that can be > +created. The pdflush algorithm will create a new pdflush thread (up to > +this maximum) if no pdflush threads have been available for >= 1 second. > > -Note that slab reclaim is triggered in a per zone / node fashion. > -The process of reclaiming slab memory is currently not node specific > -and may not be fast. > +The default value is 8. The minimum value you can specify is the > +current value of 'nr_pdflush_threads_min' and the > +maximum is 1000. > > -============================================================= > +============================================================== > > -panic_on_oom > +numa_zonelist_order > > -This enables or disables panic on out-of-memory feature. > +This sysctl is only for NUMA. > +'where the memory is allocated from' is controlled by zonelists. > +(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. > + you may be able to read ZONE_DMA as ZONE_DMA32...) > > -If this is set to 0, the kernel will kill some rogue process, > -called oom_killer. Usually, oom_killer can kill rogue processes and > -system will survive. > +In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. > +ZONE_NORMAL -> ZONE_DMA > +This means that a memory allocation request for GFP_KERNEL will > +get memory from ZONE_DMA only when ZONE_NORMAL is not available. > > -If this is set to 1, the kernel panics when out-of-memory happens. > -However, if a process limits using nodes by mempolicy/cpusets, > -and those nodes become memory exhaustion status, one process > -may be killed by oom-killer. No panic occurs in this case. > -Because other nodes' memory may be free. This means system total status > -may be not fatal yet. > +In NUMA case, you can think of following 2 types of order. > +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL > > -If this is set to 2, the kernel panics compulsorily even on the > -above-mentioned. > +(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL > +(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. > > -The default value is 0. > -1 and 2 are for failover of clustering. Please select either > -according to your policy of failover. > +Type(A) offers the best locality for processes on Node(0), but ZONE_DMA > +will be used before ZONE_NORMAL exhaustion. This increases possibility of > +out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. > > -============================================================= > +Type(B) cannot offer the best locality but is more robust against OOM of > +the DMA zone. > + > +Type(A) is called as "Node" order. Type (B) is "Zone" order. > + > +"Node order" orders the zonelists by node, then by zone within each node. > +Specify "[Nn]ode" for zone order > + > +"Zone Order" orders the zonelists by zone type, then by node within each > +zone. Specify "[Zz]one"for zode order. > + > +Specify "[Dd]efault" to request automatic configuration. Autoconfiguration > +will select "node" order in following case. > +(1) if the DMA zone does not exist or > +(2) if the DMA zone comprises greater than 50% of the available memory or > +(3) if any node's DMA zone comprises greater than 60% of its local memory and > + the amount of local memory is big enough. > + > +Otherwise, "zone" order will be selected. Default order is recommended unless > +this is causing problems for your system/application. > + > +============================================================== > > oom_dump_tasks > > @@ -280,7 +428,7 @@ OOM killer actually kills a memory-hogging task. > > The default value is 0. > > -============================================================= > +============================================================== > > oom_kill_allocating_task > > @@ -303,75 +451,157 @@ The default value is 0. > > ============================================================== > > -mmap_min_addr > +overcommit_memory: > > -This file indicates the amount of address space which a user process will > -be restricted from mmaping. Since kernel null dereference bugs could > -accidentally operate based on the information in the first couple of pages > -of memory userspace processes should not be allowed to write to them. By > -default this value is set to 0 and no protections will be enforced by the > -security module. Setting this value to something like 64k will allow the > -vast majority of applications to work correctly and provide defense in depth > -against future potential kernel bugs. > +This value contains a flag that enables memory overcommitment. > + > +When this flag is 0, the kernel attempts to estimate the amount > +of free memory left when userspace requests more memory. > + > +When this flag is 1, the kernel pretends there is always enough > +memory until it actually runs out. > + > +When this flag is 2, the kernel uses a "never overcommit" > +policy that attempts to prevent any overcommit of memory. > + > +This feature can be very useful because there are a lot of > +programs that malloc() huge amounts of memory "just-in-case" > +and don't use much of it. > + > +The default value is 0. > + > +See Documentation/vm/overcommit-accounting and > +security/commoncap.c::cap_vm_enough_memory() for more information. > > ============================================================== > > -numa_zonelist_order > +overcommit_ratio: > > -This sysctl is only for NUMA. > -'where the memory is allocated from' is controlled by zonelists. > -(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. > - you may be able to read ZONE_DMA as ZONE_DMA32...) > +When overcommit_memory is set to 2, the committed address > +space is not permitted to exceed swap plus this percentage > +of physical RAM. See above. > > -In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. > -ZONE_NORMAL -> ZONE_DMA > -This means that a memory allocation request for GFP_KERNEL will > -get memory from ZONE_DMA only when ZONE_NORMAL is not available. > +============================================================== > > -In NUMA case, you can think of following 2 types of order. > -Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL > +page-cluster > > -(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL > -(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. > +page-cluster controls the number of pages which are written to swap in > +a single attempt. The swap I/O size. > > -Type(A) offers the best locality for processes on Node(0), but ZONE_DMA > -will be used before ZONE_NORMAL exhaustion. This increases possibility of > -out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. > +It is a logarithmic value - setting it to zero means "1 page", setting > +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. > > -Type(B) cannot offer the best locality but is more robust against OOM of > -the DMA zone. > +The default value is three (eight pages at a time). There may be some > +small benefits in tuning this to a different value if your workload is > +swap-intensive. > > -Type(A) is called as "Node" order. Type (B) is "Zone" order. > +============================================================= > > -"Node order" orders the zonelists by node, then by zone within each node. > -Specify "[Nn]ode" for zone order > +panic_on_oom > > -"Zone Order" orders the zonelists by zone type, then by node within each > -zone. Specify "[Zz]one"for zode order. > +This enables or disables panic on out-of-memory feature. > > -Specify "[Dd]efault" to request automatic configuration. Autoconfiguration > -will select "node" order in following case. > -(1) if the DMA zone does not exist or > -(2) if the DMA zone comprises greater than 50% of the available memory or > -(3) if any node's DMA zone comprises greater than 60% of its local memory and > - the amount of local memory is big enough. > +If this is set to 0, the kernel will kill some rogue process, > +called oom_killer. Usually, oom_killer can kill rogue processes and > +system will survive. > > -Otherwise, "zone" order will be selected. Default order is recommended unless > -this is causing problems for your system/application. > +If this is set to 1, the kernel panics when out-of-memory happens. > +However, if a process limits using nodes by mempolicy/cpusets, > +and those nodes become memory exhaustion status, one process > +may be killed by oom-killer. No panic occurs in this case. > +Because other nodes' memory may be free. This means system total status > +may be not fatal yet. > + > +If this is set to 2, the kernel panics compulsorily even on the > +above-mentioned. > + > +The default value is 0. > +1 and 2 are for failover of clustering. Please select either > +according to your policy of failover. > + > +============================================================= > + > +percpu_pagelist_fraction > + > +This is the fraction of pages at most (high mark pcp->high) in each zone that > +are allocated for each per cpu page list. The min value for this is 8. It > +means that we don't allow more than 1/8th of pages in each zone to be > +allocated in any single per_cpu_pagelist. This entry only changes the value > +of hot per cpu pagelists. User can specify a number like 100 to allocate > +1/100th of each zone to each per cpu page list. > + > +The batch value of each per cpu pagelist is also updated as a result. It is > +set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) > + > +The initial value is zero. Kernel does not use this value at boot time to set > +the high water marks for each per cpu page list. > > ============================================================== > > -nr_hugepages > +stat_interval > > -Change the minimum size of the hugepage pool. > +The time interval between which vm statistics are updated. The default > +is 1 second. > > -See Documentation/vm/hugetlbpage.txt > +============================================================== > + > +swappiness > + > +This control is used to define how aggressive the kernel will swap > +memory pages. Higher values will increase agressiveness, lower values > +descrease the amount of swap. > + > +The default value is 60. > > ============================================================== > > -nr_overcommit_hugepages > +vfs_cache_pressure > +------------------ > > -Change the maximum size of the hugepage pool. The maximum is > -nr_hugepages + nr_overcommit_hugepages. > +Controls the tendency of the kernel to reclaim the memory which is used for > +caching of directory and inode objects. > > -See Documentation/vm/hugetlbpage.txt > +At the default value of vfs_cache_pressure=100 the kernel will attempt to > +reclaim dentries and inodes at a "fair" rate with respect to pagecache and > +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer > +to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 > +causes the kernel to prefer to reclaim dentries and inodes. > + > +============================================================== > + > +zone_reclaim_mode: > + > +Zone_reclaim_mode allows someone to set more or less aggressive approaches to > +reclaim memory when a zone runs out of memory. If it is set to zero then no > +zone reclaim occurs. Allocations will be satisfied from other zones / nodes > +in the system. > + > +This is value ORed together of > + > +1 = Zone reclaim on > +2 = Zone reclaim writes dirty pages out > +4 = Zone reclaim swaps pages > + > +zone_reclaim_mode is set during bootup to 1 if it is determined that pages > +from remote zones will cause a measurable performance reduction. The > +page allocator will then reclaim easily reusable pages (those page > +cache pages that are currently not used) before allocating off node pages. > + > +It may be beneficial to switch off zone reclaim if the system is > +used for a file server and all of memory should be used for caching files > +from disk. In that case the caching effect is more important than > +data locality. > + > +Allowing zone reclaim to write out pages stops processes that are > +writing large amounts of data from dirtying pages on other nodes. Zone > +reclaim will write out dirty pages if a zone fills up and so effectively > +throttle the process. This may decrease the performance of a single process > +since it cannot use all of system memory to buffer the outgoing writes > +anymore but it preserve the memory on other nodes so that the performance > +of other processes running on other nodes will not be affected. > + > +Allowing regular swap effectively restricts allocations to the local > +node unless explicitly overridden by memory policies or cpuset > +configurations. > + > +============ End of Document ================================= -- ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/