Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757028Ab0GWRhY (ORCPT ); Fri, 23 Jul 2010 13:37:24 -0400 Received: from mx3-phx2.redhat.com ([209.132.183.24]:42515 "EHLO mx01.colomx.prod.int.phx2.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755962Ab0GWRhT (ORCPT ); Fri, 23 Jul 2010 13:37:19 -0400 Date: Fri, 23 Jul 2010 13:36:22 -0400 (EDT) From: caiqian@redhat.com To: Nitin Gupta Cc: linux-mm , linux-kernel , Pekka Enberg , Hugh Dickins , Andrew Morton , Greg KH , Dan Magenheimer , Rik van Riel , Avi Kivity Message-ID: <907206525.1113481279906582013.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> In-Reply-To: <575348163.1113381279906498028.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Subject: Re: [PATCH 0/8] zcache: page cache compression support MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [10.5.5.71] X-Mailer: Zimbra 5.0.21_GA_3150.RHEL4_64 (ZimbraWebClient - FF3.0 (Linux)/5.0.21_GA_3150.RHEL4_64) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 17279 Lines: 385 ----- "Nitin Gupta" wrote: > Frequently accessed filesystem data is stored in memory to reduce > access to > (much) slower backing disks. Under memory pressure, these pages are > freed and > when needed again, they have to be read from disks again. When > combined working > set of all running application exceeds amount of physical RAM, we get > extereme > slowdown as reading a page from disk can take time in order of > milliseconds. > > Memory compression increases effective memory size and allows more > pages to > stay in RAM. Since de/compressing memory pages is several orders of > magnitude > faster than disk I/O, this can provide signifant performance gains for > many > workloads. Also, with multi-cores becoming common, benefits of reduced > disk I/O > should easily outweigh the problem of increased CPU usage. > > It is implemented as a "backend" for cleancache_ops [1] which > provides > callbacks for events such as when a page is to be removed from the > page cache > and when it is required again. We use them to implement a 'second > chance' cache > for these evicted page cache pages by compressing and storing them in > memory > itself. > > We only keep pages that compress to PAGE_SIZE/2 or less. Compressed > chunks are > stored using xvmalloc memory allocator which is already being used by > zram > driver for the same purpose. Zero-filled pages are checked and no > memory is > allocated for them. > > A separate "pool" is created for each mount instance for a > cleancache-aware > filesystem. Each incoming page is identified with index> > where inode_no identifies file within the filesystem corresponding to > pool_id > and index is offset of the page within this inode. Within a pool, > inodes are > maintained in an rb-tree and each of its nodes points to a separate > radix-tree > which maintains list of pages within that inode. > > While compression reduces disk I/O, it also reduces the space > available for > normal (uncompressed) page cache. This can result in more frequent > page cache > reclaim and thus higher CPU overhead. Thus, it's important to maintain > good hit > rate for compressed cache or increased CPU overhead can nullify any > other > benefits. This requires adaptive (compressed) cache resizing and page > replacement policies that can maintain optimal cache size and quickly > reclaim > unused compressed chunks. This work is yet to be done. However, in the > current > state, it allows manually resizing cache size using (per-pool) sysfs > node > 'memlimit' which in turn frees any excess pages *sigh* randomly. > > Finally, it uses percpu stats and compression buffers to allow better > performance on multi-cores. Still, there are known bottlenecks like a > single > xvmalloc mempool per zcache pool and few others. I will work on this > when I > start with profiling. > > * Performance numbers: > - Tested using iozone filesystem benchmark > - 4 CPUs, 1G RAM > - Read performance gain: ~2.5X > - Random read performance gain: ~3X > - In general, performance gains for every kind of I/O > > Test details with graphs can be found here: > http://code.google.com/p/compcache/wiki/zcacheIOzone > > If I can get some help with testing, it would be intersting to find > its > effect in more real-life workloads. In particular, I'm intersted in > finding > out its effect in KVM virtualization case where it can potentially > allow > running more number of VMs per-host for a given amount of RAM. With > zcache > enabled, VMs can be assigned much smaller amount of memory since host > can now > hold bulk of page-cache pages, allowing VMs to maintain similar level > of > performance while a greater number of them can be hosted. > > * How to test: > All patches are against 2.6.35-rc5: > > - First, apply all prerequisite patches here: > http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches > > - Then apply this patch series; also uploaded here: > http://compcache.googlecode.com/hg/sub-projects/zcache_patches > > > Nitin Gupta (8): > Allow sharing xvmalloc for zram and zcache > Basic zcache functionality > Create sysfs nodes and export basic statistics > Shrink zcache based on memlimit > Eliminate zero-filled pages > Compress pages using LZO > Use xvmalloc to store compressed chunks > Document sysfs entries > > Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 + > drivers/staging/Makefile | 2 + > drivers/staging/zram/Kconfig | 22 + > drivers/staging/zram/Makefile | 5 +- > drivers/staging/zram/xvmalloc.c | 8 + > drivers/staging/zram/zcache_drv.c | 1312 > ++++++++++++++++++++++ > drivers/staging/zram/zcache_drv.h | 90 ++ > 7 files changed, 1491 insertions(+), 1 deletions(-) > create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-zcache > create mode 100644 drivers/staging/zram/zcache_drv.c > create mode 100644 drivers/staging/zram/zcache_drv.h By tested those patches on the top of the linus tree at this commit d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even though there looked like still lots of swap. # free -m total used free shared buffers cached Mem: 852 379 473 0 3 15 -/+ buffers/cache: 359 492 Swap: 2015 14 2001 # ./usemem 1024 0: Mallocing 32 megabytes 1: Mallocing 32 megabytes 2: Mallocing 32 megabytes 3: Mallocing 32 megabytes 4: Mallocing 32 megabytes 5: Mallocing 32 megabytes 6: Mallocing 32 megabytes 7: Mallocing 32 megabytes 8: Mallocing 32 megabytes 9: Mallocing 32 megabytes 10: Mallocing 32 megabytes 11: Mallocing 32 megabytes 12: Mallocing 32 megabytes 13: Mallocing 32 megabytes 14: Mallocing 32 megabytes 15: Mallocing 32 megabytes Connection to 192.168.122.193 closed. usemem invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0 usemem cpuset=/ mems_allowed=0 Pid: 1829, comm: usemem Not tainted 2.6.35-rc5+ #5 Call Trace: [] ? _raw_spin_unlock+0x2b/0x40 [] dump_header+0x70/0x190 [] oom_kill_process+0x81/0x180 [] __out_of_memory+0x58/0xd0 [] ? out_of_memory+0x15c/0x1f0 [] out_of_memory+0x10f/0x1f0 [] __alloc_pages_nodemask+0x7af/0x7c0 [] alloc_page_vma+0x89/0x140 [] handle_mm_fault+0x6d6/0x990 [] ? _raw_spin_unlock+0x2b/0x40 [] ? follow_page+0x19d/0x350 [] __get_user_pages+0x16c/0x480 [] ? sched_clock+0x9/0x10 [] __mlock_vma_pages_range+0xef/0x1f0 [] mlock_vma_pages_range+0x91/0xa0 [] mmap_region+0x307/0x5b0 [] do_mmap_pgoff+0x354/0x3a0 [] ? sys_mmap_pgoff+0x5c/0x200 [] sys_mmap_pgoff+0x7a/0x200 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] sys_mmap+0x29/0x30 [] system_call_fastpath+0x16/0x1b Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 140 CPU 1: hi: 186, btch: 31 usd: 47 active_anon:128 inactive_anon:140 isolated_anon:0 active_file:0 inactive_file:9 isolated_file:0 unevictable:126855 dirty:0 writeback:125 unstable:0 free:1996 slab_reclaimable:4445 slab_unreclaimable:23646 mapped:923 shmem:7 pagetables:778 bounce:0 Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:11896kB isolated(anon):0kB isolated(file):0kB present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 994 994 994 Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB active_anon:512kB inactive_anon:560kB active_file:0kB inactive_file:36kB unevictable:495524kB isolated(anon):0kB isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB writeback:500kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 4032kB Node 0 DMA32: 476*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB 1146 total pagecache pages 215 pages in swap cache Swap cache stats: add 19633, delete 19418, find 941/1333 Free swap = 2051080kB Total swap = 2064380kB 262138 pages RAM 43914 pages reserved 4832 pages shared 155665 pages non-shared Out of memory: kill process 1727 (console-kit-dae) score 1027939 or a child Killed process 1727 (console-kit-dae) vsz:4111756kB, anon-rss:0kB, file-rss:600kB console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 console-kit-dae cpuset=/ mems_allowed=0 Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5 Call Trace: [] ? _raw_spin_unlock+0x2b/0x40 [] dump_header+0x70/0x190 [] oom_kill_process+0x81/0x180 [] __out_of_memory+0x58/0xd0 [] ? out_of_memory+0x15c/0x1f0 [] out_of_memory+0x10f/0x1f0 [] __alloc_pages_nodemask+0x7af/0x7c0 [] kmem_getpages+0x6e/0x180 [] fallback_alloc+0x1c9/0x2b0 [] ? cache_grow+0x4b2/0x520 [] ____cache_alloc_node+0xab/0x200 [] ? taskstats_exit+0x305/0x3b0 [] kmem_cache_alloc+0x1fb/0x290 [] taskstats_exit+0x305/0x3b0 [] do_exit+0x12b/0x890 [] ? trace_hardirqs_off+0xd/0x10 [] ? cpu_clock+0x6f/0x80 [] ? lock_release_holdtime+0x3d/0x190 [] ? _raw_spin_unlock_irq+0x30/0x40 [] do_group_exit+0x5e/0xd0 [] get_signal_to_deliver+0x2d4/0x490 [] ? inode_has_perm+0x7d/0xf0 [] do_signal+0x75/0x7b0 [] ? vfs_ioctl+0x3d/0xf0 [] ? do_vfs_ioctl+0x84/0x570 [] do_notify_resume+0x65/0x80 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] int_signal+0x12/0x17 Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 151 CPU 1: hi: 186, btch: 31 usd: 61 active_anon:128 inactive_anon:165 isolated_anon:0 active_file:0 inactive_file:9 isolated_file:0 unevictable:126855 dirty:0 writeback:25 unstable:0 free:1965 slab_reclaimable:4445 slab_unreclaimable:23646 mapped:923 shmem:7 pagetables:778 bounce:0 Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:11896kB isolated(anon):0kB isolated(file):0kB present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 994 994 994 Node 0 DMA32 free:3828kB min:4000kB low:5000kB high:6000kB active_anon:512kB inactive_anon:660kB active_file:0kB inactive_file:36kB unevictable:495524kB isolated(anon):0kB isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 4032kB Node 0 DMA32: 445*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3828kB 1146 total pagecache pages 230 pages in swap cache Swap cache stats: add 19649, delete 19419, find 942/1336 Free swap = 2051084kB Total swap = 2064380kB 262138 pages RAM 43914 pages reserved 4818 pages shared 155685 pages non-shared Out of memory: kill process 1806 (sshd) score 9474 or a child Killed process 1810 (bash) vsz:108384kB, anon-rss:0kB, file-rss:656kB console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 console-kit-dae cpuset=/ mems_allowed=0 Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5 Call Trace: [] ? _raw_spin_unlock+0x2b/0x40 [] dump_header+0x70/0x190 [] oom_kill_process+0x81/0x180 [] __out_of_memory+0x58/0xd0 [] ? out_of_memory+0x15c/0x1f0 [] out_of_memory+0x10f/0x1f0 [] __alloc_pages_nodemask+0x7af/0x7c0 [] kmem_getpages+0x6e/0x180 [] fallback_alloc+0x1c9/0x2b0 [] ? cache_grow+0x4b2/0x520 [] ____cache_alloc_node+0xab/0x200 [] ? taskstats_exit+0x305/0x3b0 [] kmem_cache_alloc+0x1fb/0x290 [] taskstats_exit+0x305/0x3b0 [] do_exit+0x12b/0x890 [] ? trace_hardirqs_off+0xd/0x10 [] ? cpu_clock+0x6f/0x80 [] ? lock_release_holdtime+0x3d/0x190 [] ? _raw_spin_unlock_irq+0x30/0x40 [] do_group_exit+0x5e/0xd0 [] get_signal_to_deliver+0x2d4/0x490 [] ? inode_has_perm+0x7d/0xf0 [] do_signal+0x75/0x7b0 [] ? vfs_ioctl+0x3d/0xf0 [] ? do_vfs_ioctl+0x84/0x570 [] do_notify_resume+0x65/0x80 [] ? trace_hardirqs_on_thunk+0x3a/0x3f [] int_signal+0x12/0x17 Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 119 CPU 1: hi: 186, btch: 31 usd: 73 active_anon:50 inactive_anon:175 isolated_anon:0 active_file:0 inactive_file:9 isolated_file:0 unevictable:126855 dirty:0 writeback:25 unstable:0 free:1996 slab_reclaimable:4445 slab_unreclaimable:23663 mapped:923 shmem:7 pagetables:778 bounce:0 Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:11896kB isolated(anon):0kB isolated(file):0kB present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 994 994 994 Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB active_anon:200kB inactive_anon:700kB active_file:0kB inactive_file:36kB unevictable:495524kB isolated(anon):0kB isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB slab_unreclaimable:94652kB kernel_stack:1296kB pagetables:3088kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1536 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB 1*1024kB 1*2048kB 0*4096kB = 4032kB Node 0 DMA32: 470*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB 1146 total pagecache pages 221 pages in swap cache Swap cache stats: add 19848, delete 19627, find 970/1386 Free swap = 2051428kB Total swap = 2064380kB 262138 pages RAM 43914 pages reserved 4669 pages shared 155659 pages non-shared Out of memory: kill process 1829 (usemem) score 8253 or a child Killed process 1829 (usemem) vsz:528224kB, anon-rss:502468kB, file-rss:376kB # cat usemem.c # cat usemem.c #include #include #include #include #define CHUNKS 32 int main(int argc, char *argv[]) { mlockall(MCL_FUTURE); unsigned long mb; char *buf[CHUNKS]; int i; if (argc < 2) { fprintf(stderr, "usage: usemem megabytes\n"); exit(1); } mb = strtoul(argv[1], NULL, 0); for (i = 0; i < CHUNKS; i++) { fprintf(stderr, "%d: Mallocing %lu megabytes\n", i, mb/CHUNKS); buf[i] = (char *)malloc(mb/CHUNKS * 1024L * 1024L); if (!buf[i]) { fprintf(stderr, "malloc failure\n"); exit(1); } } for (i = 0; i < CHUNKS; i++) { fprintf(stderr, "%d: Zeroing %lu megabytes at %p\n", i, mb/CHUNKS, buf[i]); memset(buf[i], 0, mb/CHUNKS * 1024L * 1024L); } exit(0); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/