Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755774Ab0GWSDS (ORCPT ); Fri, 23 Jul 2010 14:03:18 -0400 Received: from mx3-phx2.redhat.com ([209.132.183.24]:43987 "EHLO mx01.colomx.prod.int.phx2.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752112Ab0GWSDR (ORCPT ); Fri, 23 Jul 2010 14:03:17 -0400 Date: Fri, 23 Jul 2010 14:02:16 -0400 (EDT) From: CAI Qian To: linux-mm , linux-kernel , Pekka Enberg , Hugh Dickins , Andrew Morton , Greg KH , Dan Magenheimer , Rik van Riel , Avi Kivity , Nitin Gupta Message-ID: <1507379750.1116011279908136772.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> In-Reply-To: <645277378.1113891279906891174.JavaMail.root@zmail06.collab.prod.int.phx2.redhat.com> Subject: Re: [PATCH 0/8] zcache: page cache compression support MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [10.5.5.71] X-Mailer: Zimbra 5.0.21_GA_3150.RHEL4_64 (ZimbraWebClient - FF3.0 (Linux)/5.0.21_GA_3150.RHEL4_64) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 19625 Lines: 499 Ignore me. The test case should not be using mlockall()! ----- "CAI Qian" wrote: > ----- caiqian@redhat.com wrote: > > > ----- "Nitin Gupta" wrote: > > > > > Frequently accessed filesystem data is stored in memory to reduce > > > access to > > > (much) slower backing disks. Under memory pressure, these pages > are > > > freed and > > > when needed again, they have to be read from disks again. When > > > combined working > > > set of all running application exceeds amount of physical RAM, we > > get > > > extereme > > > slowdown as reading a page from disk can take time in order of > > > milliseconds. > > > > > > Memory compression increases effective memory size and allows > more > > > pages to > > > stay in RAM. Since de/compressing memory pages is several orders > of > > > magnitude > > > faster than disk I/O, this can provide signifant performance > gains > > for > > > many > > > workloads. Also, with multi-cores becoming common, benefits of > > reduced > > > disk I/O > > > should easily outweigh the problem of increased CPU usage. > > > > > > It is implemented as a "backend" for cleancache_ops [1] which > > > provides > > > callbacks for events such as when a page is to be removed from > the > > > page cache > > > and when it is required again. We use them to implement a 'second > > > chance' cache > > > for these evicted page cache pages by compressing and storing > them > > in > > > memory > > > itself. > > > > > > We only keep pages that compress to PAGE_SIZE/2 or less. > Compressed > > > chunks are > > > stored using xvmalloc memory allocator which is already being > used > > by > > > zram > > > driver for the same purpose. Zero-filled pages are checked and no > > > memory is > > > allocated for them. > > > > > > A separate "pool" is created for each mount instance for a > > > cleancache-aware > > > filesystem. Each incoming page is identified with > inode_no, > > > index> > > > where inode_no identifies file within the filesystem > corresponding > > to > > > pool_id > > > and index is offset of the page within this inode. Within a pool, > > > inodes are > > > maintained in an rb-tree and each of its nodes points to a > separate > > > radix-tree > > > which maintains list of pages within that inode. > > > > > > While compression reduces disk I/O, it also reduces the space > > > available for > > > normal (uncompressed) page cache. This can result in more > frequent > > > page cache > > > reclaim and thus higher CPU overhead. Thus, it's important to > > maintain > > > good hit > > > rate for compressed cache or increased CPU overhead can nullify > any > > > other > > > benefits. This requires adaptive (compressed) cache resizing and > > page > > > replacement policies that can maintain optimal cache size and > > quickly > > > reclaim > > > unused compressed chunks. This work is yet to be done. However, > in > > the > > > current > > > state, it allows manually resizing cache size using (per-pool) > > sysfs > > > node > > > 'memlimit' which in turn frees any excess pages *sigh* randomly. > > > > > > Finally, it uses percpu stats and compression buffers to allow > > better > > > performance on multi-cores. Still, there are known bottlenecks > like > > a > > > single > > > xvmalloc mempool per zcache pool and few others. I will work on > > this > > > when I > > > start with profiling. > > > > > > * Performance numbers: > > > - Tested using iozone filesystem benchmark > > > - 4 CPUs, 1G RAM > > > - Read performance gain: ~2.5X > > > - Random read performance gain: ~3X > > > - In general, performance gains for every kind of I/O > > > > > > Test details with graphs can be found here: > > > http://code.google.com/p/compcache/wiki/zcacheIOzone > > > > > > If I can get some help with testing, it would be intersting to > find > > > its > > > effect in more real-life workloads. In particular, I'm intersted > in > > > finding > > > out its effect in KVM virtualization case where it can > potentially > > > allow > > > running more number of VMs per-host for a given amount of RAM. > With > > > zcache > > > enabled, VMs can be assigned much smaller amount of memory since > > host > > > can now > > > hold bulk of page-cache pages, allowing VMs to maintain similar > > level > > > of > > > performance while a greater number of them can be hosted. > > > > > > * How to test: > > > All patches are against 2.6.35-rc5: > > > > > > - First, apply all prerequisite patches here: > > > > http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches > > > > > > - Then apply this patch series; also uploaded here: > > > http://compcache.googlecode.com/hg/sub-projects/zcache_patches > > > > > > > > > Nitin Gupta (8): > > > Allow sharing xvmalloc for zram and zcache > > > Basic zcache functionality > > > Create sysfs nodes and export basic statistics > > > Shrink zcache based on memlimit > > > Eliminate zero-filled pages > > > Compress pages using LZO > > > Use xvmalloc to store compressed chunks > > > Document sysfs entries > > > > > > Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 + > > > drivers/staging/Makefile | 2 + > > > drivers/staging/zram/Kconfig | 22 + > > > drivers/staging/zram/Makefile | 5 +- > > > drivers/staging/zram/xvmalloc.c | 8 + > > > drivers/staging/zram/zcache_drv.c | 1312 > > > ++++++++++++++++++++++ > > > drivers/staging/zram/zcache_drv.h | 90 ++ > > > 7 files changed, 1491 insertions(+), 1 deletions(-) > > > create mode 100644 > > Documentation/ABI/testing/sysfs-kernel-mm-zcache > > > create mode 100644 drivers/staging/zram/zcache_drv.c > > > create mode 100644 drivers/staging/zram/zcache_drv.h > > By tested those patches on the top of the linus tree at this commit > > d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even > > though there looked like still lots of swap. > > > > # free -m > > total used free shared buffers > > cached > > Mem: 852 379 473 0 3 > > > 15 > > -/+ buffers/cache: 359 492 > > Swap: 2015 14 2001 > > > > # ./usemem 1024 > > 0: Mallocing 32 megabytes > > 1: Mallocing 32 megabytes > > 2: Mallocing 32 megabytes > > 3: Mallocing 32 megabytes > > 4: Mallocing 32 megabytes > > 5: Mallocing 32 megabytes > > 6: Mallocing 32 megabytes > > 7: Mallocing 32 megabytes > > 8: Mallocing 32 megabytes > > 9: Mallocing 32 megabytes > > 10: Mallocing 32 megabytes > > 11: Mallocing 32 megabytes > > 12: Mallocing 32 megabytes > > 13: Mallocing 32 megabytes > > 14: Mallocing 32 megabytes > > 15: Mallocing 32 megabytes > > Connection to 192.168.122.193 closed. > > > > usemem invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0 > > usemem cpuset=/ mems_allowed=0 > > Pid: 1829, comm: usemem Not tainted 2.6.35-rc5+ #5 > > Call Trace: > > [] ? _raw_spin_unlock+0x2b/0x40 > > [] dump_header+0x70/0x190 > > [] oom_kill_process+0x81/0x180 > > [] __out_of_memory+0x58/0xd0 > > [] ? out_of_memory+0x15c/0x1f0 > > [] out_of_memory+0x10f/0x1f0 > > [] __alloc_pages_nodemask+0x7af/0x7c0 > > [] alloc_page_vma+0x89/0x140 > > [] handle_mm_fault+0x6d6/0x990 > > [] ? _raw_spin_unlock+0x2b/0x40 > > [] ? follow_page+0x19d/0x350 > > [] __get_user_pages+0x16c/0x480 > > [] ? sched_clock+0x9/0x10 > > [] __mlock_vma_pages_range+0xef/0x1f0 > > [] mlock_vma_pages_range+0x91/0xa0 > > [] mmap_region+0x307/0x5b0 > > [] do_mmap_pgoff+0x354/0x3a0 > > [] ? sys_mmap_pgoff+0x5c/0x200 > > [] sys_mmap_pgoff+0x7a/0x200 > > [] ? trace_hardirqs_on_thunk+0x3a/0x3f > > [] sys_mmap+0x29/0x30 > > [] system_call_fastpath+0x16/0x1b > > Mem-Info: > > Node 0 DMA per-cpu: > > CPU 0: hi: 0, btch: 1 usd: 0 > > CPU 1: hi: 0, btch: 1 usd: 0 > > Node 0 DMA32 per-cpu: > > CPU 0: hi: 186, btch: 31 usd: 140 > > CPU 1: hi: 186, btch: 31 usd: 47 > > active_anon:128 inactive_anon:140 isolated_anon:0 > > active_file:0 inactive_file:9 isolated_file:0 > > unevictable:126855 dirty:0 writeback:125 unstable:0 > > free:1996 slab_reclaimable:4445 slab_unreclaimable:23646 > > mapped:923 shmem:7 pagetables:778 bounce:0 > > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB > > inactive_anon:0kB active_file:0kB inactive_file:0kB > > unevictable:11896kB isolated(anon):0kB isolated(file):0kB > > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB > > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB > kernel_stack:0kB > > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB > > pages_scanned:0 all_unreclaimable? yes > > lowmem_reserve[]: 0 994 994 994 > > Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB > > active_anon:512kB inactive_anon:560kB active_file:0kB > > inactive_file:36kB unevictable:495524kB isolated(anon):0kB > > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB > > writeback:500kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB > > slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB > > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 > > all_unreclaimable? yes > > lowmem_reserve[]: 0 0 0 0 > > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB > 0*512kB > > 1*1024kB 1*2048kB 0*4096kB = 4032kB > > Node 0 DMA32: 476*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB > > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB > > 1146 total pagecache pages > > 215 pages in swap cache > > Swap cache stats: add 19633, delete 19418, find 941/1333 > > Free swap = 2051080kB > > Total swap = 2064380kB > > 262138 pages RAM > > 43914 pages reserved > > 4832 pages shared > > 155665 pages non-shared > > Out of memory: kill process 1727 (console-kit-dae) score 1027939 or > a > > child > > Killed process 1727 (console-kit-dae) vsz:4111756kB, anon-rss:0kB, > > file-rss:600kB > > console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, > oom_adj=0 > > console-kit-dae cpuset=/ mems_allowed=0 > > Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5 > > Call Trace: > > [] ? _raw_spin_unlock+0x2b/0x40 > > [] dump_header+0x70/0x190 > > [] oom_kill_process+0x81/0x180 > > [] __out_of_memory+0x58/0xd0 > > [] ? out_of_memory+0x15c/0x1f0 > > [] out_of_memory+0x10f/0x1f0 > > [] __alloc_pages_nodemask+0x7af/0x7c0 > > [] kmem_getpages+0x6e/0x180 > > [] fallback_alloc+0x1c9/0x2b0 > > [] ? cache_grow+0x4b2/0x520 > > [] ____cache_alloc_node+0xab/0x200 > > [] ? taskstats_exit+0x305/0x3b0 > > [] kmem_cache_alloc+0x1fb/0x290 > > [] taskstats_exit+0x305/0x3b0 > > [] do_exit+0x12b/0x890 > > [] ? trace_hardirqs_off+0xd/0x10 > > [] ? cpu_clock+0x6f/0x80 > > [] ? lock_release_holdtime+0x3d/0x190 > > [] ? _raw_spin_unlock_irq+0x30/0x40 > > [] do_group_exit+0x5e/0xd0 > > [] get_signal_to_deliver+0x2d4/0x490 > > [] ? inode_has_perm+0x7d/0xf0 > > [] do_signal+0x75/0x7b0 > > [] ? vfs_ioctl+0x3d/0xf0 > > [] ? do_vfs_ioctl+0x84/0x570 > > [] do_notify_resume+0x65/0x80 > > [] ? trace_hardirqs_on_thunk+0x3a/0x3f > > [] int_signal+0x12/0x17 > > Mem-Info: > > Node 0 DMA per-cpu: > > CPU 0: hi: 0, btch: 1 usd: 0 > > CPU 1: hi: 0, btch: 1 usd: 0 > > Node 0 DMA32 per-cpu: > > CPU 0: hi: 186, btch: 31 usd: 151 > > CPU 1: hi: 186, btch: 31 usd: 61 > > active_anon:128 inactive_anon:165 isolated_anon:0 > > active_file:0 inactive_file:9 isolated_file:0 > > unevictable:126855 dirty:0 writeback:25 unstable:0 > > free:1965 slab_reclaimable:4445 slab_unreclaimable:23646 > > mapped:923 shmem:7 pagetables:778 bounce:0 > > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB > > inactive_anon:0kB active_file:0kB inactive_file:0kB > > unevictable:11896kB isolated(anon):0kB isolated(file):0kB > > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB > > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB > kernel_stack:0kB > > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB > > pages_scanned:0 all_unreclaimable? yes > > lowmem_reserve[]: 0 994 994 994 > > Node 0 DMA32 free:3828kB min:4000kB low:5000kB high:6000kB > > active_anon:512kB inactive_anon:660kB active_file:0kB > > inactive_file:36kB unevictable:495524kB isolated(anon):0kB > > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB > > writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB > > slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB > > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726 > > all_unreclaimable? yes > > lowmem_reserve[]: 0 0 0 0 > > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB > 0*512kB > > 1*1024kB 1*2048kB 0*4096kB = 4032kB > > Node 0 DMA32: 445*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB > > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3828kB > > 1146 total pagecache pages > > 230 pages in swap cache > > Swap cache stats: add 19649, delete 19419, find 942/1336 > > Free swap = 2051084kB > > Total swap = 2064380kB > > 262138 pages RAM > > 43914 pages reserved > > 4818 pages shared > > 155685 pages non-shared > > Out of memory: kill process 1806 (sshd) score 9474 or a child > > Killed process 1810 (bash) vsz:108384kB, anon-rss:0kB, > file-rss:656kB > > console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, > oom_adj=0 > > console-kit-dae cpuset=/ mems_allowed=0 > > Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5 > > Call Trace: > > [] ? _raw_spin_unlock+0x2b/0x40 > > [] dump_header+0x70/0x190 > > [] oom_kill_process+0x81/0x180 > > [] __out_of_memory+0x58/0xd0 > > [] ? out_of_memory+0x15c/0x1f0 > > [] out_of_memory+0x10f/0x1f0 > > [] __alloc_pages_nodemask+0x7af/0x7c0 > > [] kmem_getpages+0x6e/0x180 > > [] fallback_alloc+0x1c9/0x2b0 > > [] ? cache_grow+0x4b2/0x520 > > [] ____cache_alloc_node+0xab/0x200 > > [] ? taskstats_exit+0x305/0x3b0 > > [] kmem_cache_alloc+0x1fb/0x290 > > [] taskstats_exit+0x305/0x3b0 > > [] do_exit+0x12b/0x890 > > [] ? trace_hardirqs_off+0xd/0x10 > > [] ? cpu_clock+0x6f/0x80 > > [] ? lock_release_holdtime+0x3d/0x190 > > [] ? _raw_spin_unlock_irq+0x30/0x40 > > [] do_group_exit+0x5e/0xd0 > > [] get_signal_to_deliver+0x2d4/0x490 > > [] ? inode_has_perm+0x7d/0xf0 > > [] do_signal+0x75/0x7b0 > > [] ? vfs_ioctl+0x3d/0xf0 > > [] ? do_vfs_ioctl+0x84/0x570 > > [] do_notify_resume+0x65/0x80 > > [] ? trace_hardirqs_on_thunk+0x3a/0x3f > > [] int_signal+0x12/0x17 > > Mem-Info: > > Node 0 DMA per-cpu: > > CPU 0: hi: 0, btch: 1 usd: 0 > > CPU 1: hi: 0, btch: 1 usd: 0 > > Node 0 DMA32 per-cpu: > > CPU 0: hi: 186, btch: 31 usd: 119 > > CPU 1: hi: 186, btch: 31 usd: 73 > > active_anon:50 inactive_anon:175 isolated_anon:0 > > active_file:0 inactive_file:9 isolated_file:0 > > unevictable:126855 dirty:0 writeback:25 unstable:0 > > free:1996 slab_reclaimable:4445 slab_unreclaimable:23663 > > mapped:923 shmem:7 pagetables:778 bounce:0 > > Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB > > inactive_anon:0kB active_file:0kB inactive_file:0kB > > unevictable:11896kB isolated(anon):0kB isolated(file):0kB > > present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB > > shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB > kernel_stack:0kB > > pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB > > pages_scanned:0 all_unreclaimable? yes > > lowmem_reserve[]: 0 994 994 994 > > Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB > > active_anon:200kB inactive_anon:700kB active_file:0kB > > inactive_file:36kB unevictable:495524kB isolated(anon):0kB > > isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB > > writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB > > slab_unreclaimable:94652kB kernel_stack:1296kB pagetables:3088kB > > unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1536 > > all_unreclaimable? yes > > lowmem_reserve[]: 0 0 0 0 > > Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB > 0*512kB > > 1*1024kB 1*2048kB 0*4096kB = 4032kB > > Node 0 DMA32: 470*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB > > 0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB > > 1146 total pagecache pages > > 221 pages in swap cache > > Swap cache stats: add 19848, delete 19627, find 970/1386 > > Free swap = 2051428kB > > Total swap = 2064380kB > > 262138 pages RAM > > 43914 pages reserved > > 4669 pages shared > > 155659 pages non-shared > > Out of memory: kill process 1829 (usemem) score 8253 or a child > > Killed process 1829 (usemem) vsz:528224kB, anon-rss:502468kB, > > file-rss:376kB > > > > # cat usemem.c > > # cat usemem.c > > #include > > #include > > #include > > #include > > #define CHUNKS 32 > > > > int > > main(int argc, char *argv[]) > > { > > mlockall(MCL_FUTURE); > > > > unsigned long mb; > > char *buf[CHUNKS]; > > int i; > > > > if (argc < 2) { > > fprintf(stderr, "usage: usemem megabytes\n"); > > exit(1); > > } > > mb = strtoul(argv[1], NULL, 0); > > > > for (i = 0; i < CHUNKS; i++) { > > fprintf(stderr, "%d: Mallocing %lu megabytes\n", i, mb/CHUNKS); > > buf[i] = (char *)malloc(mb/CHUNKS * 1024L * 1024L); > > if (!buf[i]) { > > fprintf(stderr, "malloc failure\n"); > > exit(1); > > } > > } > > > > for (i = 0; i < CHUNKS; i++) { > > fprintf(stderr, "%d: Zeroing %lu megabytes at %p\n", > > i, mb/CHUNKS, buf[i]); > > memset(buf[i], 0, mb/CHUNKS * 1024L * 1024L); > > } > > > > > > exit(0); > > } > > > If this ever be relevant, this was tested inside the kvm guest. The > host was a RHEL6 with THP enabled. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/