Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756853Ab0GWRm1 (ORCPT ); Fri, 23 Jul 2010 13:42:27 -0400 Received: from mx4-phx2.redhat.com ([209.132.183.25]:36556 "EHLO mx02.colomx.prod.int.phx2.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752112Ab0GWRmZ (ORCPT ); Fri, 23 Jul 2010 13:42:25 -0400 Date: Fri, 23 Jul 2010 13:41:31 -0400 (EDT) From: CAI Qian To: caiqian@redhat.com Cc: linux-mm , linux-kernel , Pekka Enberg , Hugh Dickins , Andrew Morton , Greg KH , Dan Magenheimer , Rik van Riel , Avi Kivity , Nitin Gupta Message-ID: <645277378.1113891279906891174.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> In-Reply-To: <907206525.1113481279906582013.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Subject: Re: [PATCH 0/8] zcache: page cache compression support MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [10.5.5.71] X-Mailer: Zimbra 5.0.21_GA_3150.RHEL4_64 (ZimbraWebClient - FF3.0 (Linux)/5.0.21_GA_3150.RHEL4_64) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 18316 Lines: 459 ----- caiqian@redhat.com wrote: > ----- "Nitin Gupta" wrote: > > > Frequently accessed filesystem data is stored in memory to reduce > > access to > > (much) slower backing disks. Under memory pressure, these pages are > > freed and > > when needed again, they have to be read from disks again. When > > combined working > > set of all running application exceeds amount of physical RAM, we > get > > extereme > > slowdown as reading a page from disk can take time in order of > > milliseconds. > > > > Memory compression increases effective memory size and allows more > > pages to > > stay in RAM. Since de/compressing memory pages is several orders of > > magnitude > > faster than disk I/O, this can provide signifant performance gains > for > > many > > workloads. Also, with multi-cores becoming common, benefits of > reduced > > disk I/O > > should easily outweigh the problem of increased CPU usage. > > > > It is implemented as a "backend" for cleancache_ops [1] which > > provides > > callbacks for events such as when a page is to be removed from the > > page cache > > and when it is required again. We use them to implement a 'second > > chance' cache > > for these evicted page cache pages by compressing and storing them > in > > memory > > itself. > > > > We only keep pages that compress to PAGE_SIZE/2 or less. Compressed > > chunks are > > stored using xvmalloc memory allocator which is already being used > by > > zram > > driver for the same purpose. Zero-filled pages are checked and no > > memory is > > allocated for them. > > > > A separate "pool" is created for each mount instance for a > > cleancache-aware > > filesystem. Each incoming page is identified with inode_no, > > index> > > where inode_no identifies file within the filesystem corresponding > to > > pool_id > > and index is offset of the page within this inode. Within a pool, > > inodes are > > maintained in an rb-tree and each of its nodes points to a separate > > radix-tree > > which maintains list of pages within that inode. > > > > While compression reduces disk I/O, it also reduces the space > > available for > > normal (uncompressed) page cache. This can result in more frequent > > page cache > > reclaim and thus higher CPU overhead. Thus, it's important to > maintain > > good hit > > rate for compressed cache or increased CPU overhead can nullify any > > other > > benefits. This requires adaptive (compressed) cache resizing and > page > > replacement policies that can maintain optimal cache size and > quickly > > reclaim > > unused compressed chunks. This work is yet to be done. However, in > the > > current > > state, it allows manually resizing cache size using (per-pool) > sysfs > > node > > 'memlimit' which in turn frees any excess pages *sigh* randomly. > > > > Finally, it uses percpu stats and compression buffers to allow > better > > performance on multi-cores. Still, there are known bottlenecks like > a > > single > > xvmalloc mempool per zcache pool and few others. I will work on > this > > when I > > start with profiling. > > > > * Performance numbers: > > - Tested using iozone filesystem benchmark > > - 4 CPUs, 1G RAM > > - Read performance gain: ~2.5X > > - Random read performance gain: ~3X > > - In general, performance gains for every kind of I/O > > > > Test details with graphs can be found here: > > http://code.google.com/p/compcache/wiki/zcacheIOzone > > > > If I can get some help with testing, it would be intersting to find > > its > > effect in more real-life workloads. In particular, I'm intersted in > > finding > > out its effect in KVM virtualization case where it can potentially > > allow > > running more number of VMs per-host for a given amount of RAM. With > > zcache > > enabled, VMs can be assigned much smaller amount of memory since > host > > can now > > hold bulk of page-cache pages, allowing VMs to maintain similar > level > > of > > performance while a greater number of them can be hosted. > > > > * How to test: > > All patches are against 2.6.35-rc5: > > > > - First, apply all prerequisite patches here: > > http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches > > > > - Then apply this patch series; also uploaded here: > > http://compcache.googlecode.com/hg/sub-projects/zcache_patches > > > > > > Nitin Gupta (8): > > Allow sharing xvmalloc for zram and zcache > > Basic zcache functionality > > Create sysfs nodes and export basic statistics > > Shrink zcache based on memlimit > > Eliminate zero-filled pages > > Compress pages using LZO > > Use xvmalloc to store compressed chunks > > Document sysfs entries > > > > Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 + > > drivers/staging/Makefile | 2 + > > drivers/staging/zram/Kconfig | 22 + > > drivers/staging/zram/Makefile | 5 +- > > drivers/staging/zram/xvmalloc.c | 8 + > > drivers/staging/zram/zcache_drv.c | 1312 > > ++++++++++++++++++++++ > > drivers/staging/zram/zcache_drv.h | 90 ++ > > 7 files changed, 1491 insertions(+), 1 deletions(-) > > create mode 100644 > Documentation/ABI/testing/sysfs-kernel-mm-zcache > > create mode 100644 drivers/staging/zram/zcache_drv.c > > create mode 100644 drivers/staging/zram/zcache_drv.h > By tested those patches on the top of the linus tree at this commit > d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even > though there looked like still lots of swap. > > # free -m > total used free shared buffers > cached > Mem: 852 379 473 0 3 > 15 > -/+ buffers/cache: 359 492 > Swap: 2015 14 2001 > > # ./usemem 1024 > 0: Mallocing 32 megabytes > 1: Mallocing 32 megabytes > 2: Mallocing 32 megabytes > 3: Mallocing 32 megabytes > 4: Mallocing 32 megabytes > 5: Mallocing 32 megabytes > 6: Mallocing 32 megabytes > 7: Mallocing 32 megabytes > 8: Mallocing 32 megabytes > 9: Mallocing 32 megabytes > 10: Mallocing 32 megabytes > 11: Mallocing 32 megabytes > 12: Mallocing 32 megabytes > 13: Mallocing 32 megabytes > 14: Mallocing 32 megabytes > 15: Mallocing 32 megabytes > Connection to 192.168.122.193 closed. > > usemem invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0 > usemem cpuset=/ mems_allowed=0 > Pid: 1829, comm: usemem Not tainted 2.6.35-rc5+ #5 > Call Trace: > [] ? _raw_spin_unlock+0x2b/0x40 > [] dump_header+0x70/0x190 > [] oom_kill_process+0x81/0x180 > [] __out_of_memory+0x58/0xd0 > [] ? out_of_memory+0x15c/0x1f0 > [] out_of_memory+0x10f/0x1f0 > [] __alloc_pages_nodemask+0x7af/0x7c0 > [] alloc_page_vma+0x89/0x140 > [] handle_mm_fault+0x6d6/0x990 > [] ? _raw_spin_unlock+0x2b/0x40 > [] ? follow_page+0x19d/0x350 > [] __get_user_pages+0x16c/0x480 > [] ? sched_clock+0x9/0x10 > [] __mlock_vma_pages_range+0xef/0x1f0 > [] mlock_vma_pages_range+0x91/0xa0 > [] mmap_region+0x307/0x5b0 > [] do_mmap_pgoff+0x354/0x3a0 > [] ? sys_mmap_pgoff+0x5c/0x200 > [] sys_mmap_pgoff+0x7a/0x200 > [] ? trace_hardirqs_on_thunk+0x3a/0x3f > [] sys_mmap+0x29/0x30 > [] system_call_fastpath+0x16/0x1b > Mem-Info: > Node 0 DMA per-cpu: > CPU 0: hi: 0, btch: 1 usd: 0 > CPU 1: hi: 0, btch: 1 usd: 0 > Node 0 DMA32 per-cpu: > CPU 0: hi: 186, btch: 31 usd: 140 > CPU 1: hi: 186, btch: 31 usd: 47 > active_anon:128 inactive_anon:140 isolated_anon:0 > active_file:0 inactive_file:9 isolated_file:0 > unevictable:126855 dirty:0 writeback:125 unstable:0 > free:1996 slab_reclaimable:4445 slab_unreclaimable:23646 > mapped:923 shmem:7 pagetables:778 bounce:0 > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB > inactive_anon:0kB active_file:0kB inactive_file:0kB > unevictable:11896kB isolated(anon):0kB isolated(file):0kB > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB > pages_scanned:0 all_unreclaimable? yes > lowmem_reserve[]: 0 994 994 994 > Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB > active_anon:512kB inactive_anon:560kB active_file:0kB > inactive_file:36kB unevictable:495524kB isolated(anon):0kB > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB > writeback:500kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB > slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 > all_unreclaimable? yes > lowmem_reserve[]: 0 0 0 0 > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB > 1*1024kB 1*2048kB 0*4096kB = 4032kB > Node 0 DMA32: 476*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB > 1146 total pagecache pages > 215 pages in swap cache > Swap cache stats: add 19633, delete 19418, find 941/1333 > Free swap = 2051080kB > Total swap = 2064380kB > 262138 pages RAM > 43914 pages reserved > 4832 pages shared > 155665 pages non-shared > Out of memory: kill process 1727 (console-kit-dae) score 1027939 or a > child > Killed process 1727 (console-kit-dae) vsz:4111756kB, anon-rss:0kB, > file-rss:600kB > console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 > console-kit-dae cpuset=/ mems_allowed=0 > Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5 > Call Trace: > [] ? _raw_spin_unlock+0x2b/0x40 > [] dump_header+0x70/0x190 > [] oom_kill_process+0x81/0x180 > [] __out_of_memory+0x58/0xd0 > [] ? out_of_memory+0x15c/0x1f0 > [] out_of_memory+0x10f/0x1f0 > [] __alloc_pages_nodemask+0x7af/0x7c0 > [] kmem_getpages+0x6e/0x180 > [] fallback_alloc+0x1c9/0x2b0 > [] ? cache_grow+0x4b2/0x520 > [] ____cache_alloc_node+0xab/0x200 > [] ? taskstats_exit+0x305/0x3b0 > [] kmem_cache_alloc+0x1fb/0x290 > [] taskstats_exit+0x305/0x3b0 > [] do_exit+0x12b/0x890 > [] ? trace_hardirqs_off+0xd/0x10 > [] ? cpu_clock+0x6f/0x80 > [] ? lock_release_holdtime+0x3d/0x190 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] do_group_exit+0x5e/0xd0 > [] get_signal_to_deliver+0x2d4/0x490 > [] ? inode_has_perm+0x7d/0xf0 > [] do_signal+0x75/0x7b0 > [] ? vfs_ioctl+0x3d/0xf0 > [] ? do_vfs_ioctl+0x84/0x570 > [] do_notify_resume+0x65/0x80 > [] ? trace_hardirqs_on_thunk+0x3a/0x3f > [] int_signal+0x12/0x17 > Mem-Info: > Node 0 DMA per-cpu: > CPU 0: hi: 0, btch: 1 usd: 0 > CPU 1: hi: 0, btch: 1 usd: 0 > Node 0 DMA32 per-cpu: > CPU 0: hi: 186, btch: 31 usd: 151 > CPU 1: hi: 186, btch: 31 usd: 61 > active_anon:128 inactive_anon:165 isolated_anon:0 > active_file:0 inactive_file:9 isolated_file:0 > unevictable:126855 dirty:0 writeback:25 unstable:0 > free:1965 slab_reclaimable:4445 slab_unreclaimable:23646 > mapped:923 shmem:7 pagetables:778 bounce:0 > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB > inactive_anon:0kB active_file:0kB inactive_file:0kB > unevictable:11896kB isolated(anon):0kB isolated(file):0kB > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB > pages_scanned:0 all_unreclaimable? yes > lowmem_reserve[]: 0 994 994 994 > Node 0 DMA32 free:3828kB min:4000kB low:5000kB high:6000kB > active_anon:512kB inactive_anon:660kB active_file:0kB > inactive_file:36kB unevictable:495524kB isolated(anon):0kB > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB > writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB > slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 > all_unreclaimable? yes > lowmem_reserve[]: 0 0 0 0 > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB > 1*1024kB 1*2048kB 0*4096kB = 4032kB > Node 0 DMA32: 445*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3828kB > 1146 total pagecache pages > 230 pages in swap cache > Swap cache stats: add 19649, delete 19419, find 942/1336 > Free swap = 2051084kB > Total swap = 2064380kB > 262138 pages RAM > 43914 pages reserved > 4818 pages shared > 155685 pages non-shared > Out of memory: kill process 1806 (sshd) score 9474 or a child > Killed process 1810 (bash) vsz:108384kB, anon-rss:0kB, file-rss:656kB > console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0 > console-kit-dae cpuset=/ mems_allowed=0 > Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5 > Call Trace: > [] ? _raw_spin_unlock+0x2b/0x40 > [] dump_header+0x70/0x190 > [] oom_kill_process+0x81/0x180 > [] __out_of_memory+0x58/0xd0 > [] ? out_of_memory+0x15c/0x1f0 > [] out_of_memory+0x10f/0x1f0 > [] __alloc_pages_nodemask+0x7af/0x7c0 > [] kmem_getpages+0x6e/0x180 > [] fallback_alloc+0x1c9/0x2b0 > [] ? cache_grow+0x4b2/0x520 > [] ____cache_alloc_node+0xab/0x200 > [] ? taskstats_exit+0x305/0x3b0 > [] kmem_cache_alloc+0x1fb/0x290 > [] taskstats_exit+0x305/0x3b0 > [] do_exit+0x12b/0x890 > [] ? trace_hardirqs_off+0xd/0x10 > [] ? cpu_clock+0x6f/0x80 > [] ? lock_release_holdtime+0x3d/0x190 > [] ? _raw_spin_unlock_irq+0x30/0x40 > [] do_group_exit+0x5e/0xd0 > [] get_signal_to_deliver+0x2d4/0x490 > [] ? inode_has_perm+0x7d/0xf0 > [] do_signal+0x75/0x7b0 > [] ? vfs_ioctl+0x3d/0xf0 > [] ? do_vfs_ioctl+0x84/0x570 > [] do_notify_resume+0x65/0x80 > [] ? trace_hardirqs_on_thunk+0x3a/0x3f > [] int_signal+0x12/0x17 > Mem-Info: > Node 0 DMA per-cpu: > CPU 0: hi: 0, btch: 1 usd: 0 > CPU 1: hi: 0, btch: 1 usd: 0 > Node 0 DMA32 per-cpu: > CPU 0: hi: 186, btch: 31 usd: 119 > CPU 1: hi: 186, btch: 31 usd: 73 > active_anon:50 inactive_anon:175 isolated_anon:0 > active_file:0 inactive_file:9 isolated_file:0 > unevictable:126855 dirty:0 writeback:25 unstable:0 > free:1996 slab_reclaimable:4445 slab_unreclaimable:23663 > mapped:923 shmem:7 pagetables:778 bounce:0 > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB > inactive_anon:0kB active_file:0kB inactive_file:0kB > unevictable:11896kB isolated(anon):0kB isolated(file):0kB > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB > pages_scanned:0 all_unreclaimable? yes > lowmem_reserve[]: 0 994 994 994 > Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB > active_anon:200kB inactive_anon:700kB active_file:0kB > inactive_file:36kB unevictable:495524kB isolated(anon):0kB > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB > writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB > slab_unreclaimable:94652kB kernel_stack:1296kB pagetables:3088kB > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1536 > all_unreclaimable? yes > lowmem_reserve[]: 0 0 0 0 > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB > 1*1024kB 1*2048kB 0*4096kB = 4032kB > Node 0 DMA32: 470*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB > 1146 total pagecache pages > 221 pages in swap cache > Swap cache stats: add 19848, delete 19627, find 970/1386 > Free swap = 2051428kB > Total swap = 2064380kB > 262138 pages RAM > 43914 pages reserved > 4669 pages shared > 155659 pages non-shared > Out of memory: kill process 1829 (usemem) score 8253 or a child > Killed process 1829 (usemem) vsz:528224kB, anon-rss:502468kB, > file-rss:376kB > > # cat usemem.c > # cat usemem.c > #include > #include > #include > #include > #define CHUNKS 32 > > int > main(int argc, char *argv[]) > { > mlockall(MCL_FUTURE); > > unsigned long mb; > char *buf[CHUNKS]; > int i; > > if (argc < 2) { > fprintf(stderr, "usage: usemem megabytes\n"); > exit(1); > } > mb = strtoul(argv[1], NULL, 0); > > for (i = 0; i < CHUNKS; i++) { > fprintf(stderr, "%d: Mallocing %lu megabytes\n", i, mb/CHUNKS); > buf[i] = (char *)malloc(mb/CHUNKS * 1024L * 1024L); > if (!buf[i]) { > fprintf(stderr, "malloc failure\n"); > exit(1); > } > } > > for (i = 0; i < CHUNKS; i++) { > fprintf(stderr, "%d: Zeroing %lu megabytes at %p\n", > i, mb/CHUNKS, buf[i]); > memset(buf[i], 0, mb/CHUNKS * 1024L * 1024L); > } > > > exit(0); > } > If this ever be relevant, this was tested inside the kvm guest. The host was a RHEL6 with THP enabled. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/