Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752206Ab3HUGQI (ORCPT ); Wed, 21 Aug 2013 02:16:08 -0400 Received: from LGEMRELSE7Q.lge.com ([156.147.1.151]:47261 "EHLO LGEMRELSE7Q.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751990Ab3HUGQF (ORCPT ); Wed, 21 Aug 2013 02:16:05 -0400 X-AuditID: 9c930197-b7b44ae00000347f-2b-52145b215328 From: Minchan Kim To: Greg Kroah-Hartman , Andrew Morton , Jens Axboe Cc: Seth Jennings , Nitin Gupta , Konrad Rzeszutek Wilk , Luigi Semenzato , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Pekka Enberg , Mel Gorman , lliubbo@gmail.com, Minchan Kim Subject: [PATCH v7 0/5] zram/zsmalloc promotion Date: Wed, 21 Aug 2013 15:16:26 +0900 Message-Id: <1377065791-2959-1-git-send-email-minchan@kernel.org> X-Mailer: git-send-email 1.7.9.5 X-Brightmail-Tracker: AAAAAA== Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9366 Lines: 188 It's 7th trial of zram/zsmalloc promotion. I rewrote cover-letter totally based on previous discussion. The main reason to prevent zram promotion was no review of zsmalloc part while Jens, block maintainer, already acked zram part. At that time, zsmalloc was used for zram, zcache and zswap so everybody wanted to make it general and at last, Mel reviewed it when zswap was submitted to merge mainline a few month ago. Most of review was related to zswap writeback mechanism which can pageout compressed page in memory into real swap storage in runtime and the conclusion was that zsmalloc isn't good for zswap writeback so zswap borrowed zbud allocator from zcache to replace zsmalloc. The zbud is bad for memory compression ratio(2) but it's very predictable behavior because we can expect a zpage includes just two pages as maximum. Other reviews were not major. http://lkml.indiana.edu/hypermail/linux/kernel/1304.1/04334.html Zcache doesn't use zsmalloc either so zsmalloc's user is only zram now so this patchset moves it into zsmalloc directory. Recently, Bob tried to move zsmalloc under mm directory to unify zram and zswap with adding pseudo block device in zswap(It's very weired to me) but he was simple ignoring zram's block device (a.k.a zram-blk) feature and considered only swap usecase of zram, in turn, it lose zram's good concept. Mel raised an another issue in v6, "maintainance headache". He claimed zswap and zram has a similar goal that is to compresss swap pages so if we promote zram, maintainance headache happens sometime by diverging implementaion between zswap and zram so that he want to unify zram and zswap. For it, he want zswap to implement pseudo block device like Bob did to emulate zram so zswap can have an advantage of writeback as well as zram's benefit. But I wonder frontswap-based zswap's writeback is really good approach for writeback POV. I think that problem isn't only specific for zswap. If we want to configure multiple swap hierarchy with various speed device such as RAM, NVRAM, SSD, eMMC, NAS etc, it would be a general problem. So we should think of more general approach. At a glance, I can see two approach. First, VM could be aware of heterogeneous swap configuration so it could aim for being able to configure cache hierarchy among swap devices. It may need indirction layer on swap, which was already talked about that way so VM can migrate a block from A to B easily. It will support various configuration with VM's hints, maybe, in future. http://lkml.indiana.edu/hypermail/linux/kernel/1203.3/03812.html Second, as more practical solution, we could use device mapper like dm-cache(https://lwn.net/Articles/540996/), which makes it very flexible. Now, it supports various configruation and cache policy (block size, writeback/writethrough, LRU, MFU although MQ is merged now) so it would be good fit for our purpose. Even, it can make zram support writeback. I tested it following as following scenario in KVM 4 CPU, 1G DRAM with background 800M memory hogger, which is allocates random data up to 800M. 1) zram swap disk 1G, untar kernel.tgz to tmpfs, build -j 4 Fail to untar due to shortage of memory space by tmpfs default size limit 2) zram swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4 OOM happens while building the kernel but it untar successfully on ext2 based on zram-blk. The reason OOM happend is zram can not find free pages from main memory to store swap out pages although empty swap space is still enough. 3) dm-cache swap disk 1G, untar kernel.tgz to ext2 on zram-blk, build -j 4 dmcache consists of zram-meta 10M, zram-cache 1G and real swap storage 1G No OOM happens and successfully building done. Above tests proves zram can support writeback into real swap storage so that zram-cache can always have a free space. If necessary, we could add new plugin in dm-cache. I see It's really flexible and well-layered architecure so zram-blk's concept is good for us and it has lots of potential to be enhanced by MM/FS/Block developers. As other disadvantage of zswap writeback, frontswap's semantic is synchronous API so zswap should decompress in memory zpage right before writeback and even, it writes pages one by one, not a batch. If we extend frontswap API, we would enhance it but I belive we can do better in device mapper layer which is aware of block align, bandwidth, mapping table, asynchronous and lots of hints from the block layer. Nonetheless, if we should merge zram's functionality to zswap, I think zram should include zswap's functionaliy(But I hope it will never happen) because old age zram already has lots of real users rather than new young zswap so it's more handy to unify them with keeping changelog which is one of valuable things getting from staging stay for a long time. The reason zram doesn't support writeback until now is just shortage of needs. The zram's main customers were embedded people so writeback into real swap storage is too bad for interactivity and wear-leveling on low falsh devices. But like above, zram has a potential to support writeback with other block drivers or more reasonable VM enhance so I'd like to claim zram's block concept is really good. Another zram-blk's usecase is following as. The admin can format /dev/zramX with any FS and mount on it. It could help small memory system, too. For exmaple, many embedded system don't have swap so although tmpfs can support swapout, it's pointless. Then, let's assume temp file growing up until half of system memory once in a while. We don't want to write it on flash by wear-leveing issue and response problem so we want to keep in-memory. But if we use tmpfs, it should evict half of working set to cover them when the size reach peak. In the case, zram-blk would be good fit, too. I'd like to enhance zram with more features like compaction to prevent fragmentation problem but zram developers cannot do it now because Greg, staging maintainer, doesn't want to add new feature until promotion is done because zram have been in staging for a very long time. Acutally, some patches about enhance are pending for a long time. It's time to promote and let's make further enhancements. Patch 1 adds new Kconfig for zram to use page table method instead of copy. Andrew suggested it. Patch 2 adds lots of comment for zsmalloc. Patch 3 moves zsmalloc under driver/staging/zram because zram is only user for zram now. Patch 4 makes unmap_kernel_range exportable function because zsmalloc have used map_vm_area which is already exported function so zsmalloc need to use unmap_kernel_range for building as module. Patch 5 moves zram from driver/staging to driver/blocks, finally. It touches mm, staging, blocks so I am not sure who is right position maintainer so I will Cc Andrew, Jens and Greg. Minchan Kim (4): zsmalloc: add Kconfig for enabling page table method zsmalloc: move it under zram mm: export unmap_kernel_range zram: promote zram from staging Nitin Cupta (1): zsmalloc: add more comment drivers/block/Kconfig | 2 + drivers/block/Makefile | 1 + drivers/block/zram/Kconfig | 37 + drivers/block/zram/Makefile | 3 + drivers/block/zram/zram.txt | 71 ++ drivers/block/zram/zram_drv.c | 987 +++++++++++++++++++++++++++ drivers/block/zram/zsmalloc.c | 1084 ++++++++++++++++++++++++++++++ drivers/staging/Kconfig | 4 - drivers/staging/Makefile | 2 - drivers/staging/zram/Kconfig | 25 - drivers/staging/zram/Makefile | 3 - drivers/staging/zram/zram.txt | 77 --- drivers/staging/zram/zram_drv.c | 984 --------------------------- drivers/staging/zram/zram_drv.h | 125 ---- drivers/staging/zsmalloc/Kconfig | 10 - drivers/staging/zsmalloc/Makefile | 3 - drivers/staging/zsmalloc/zsmalloc-main.c | 1063 ----------------------------- drivers/staging/zsmalloc/zsmalloc.h | 43 -- include/linux/zram.h | 123 ++++ include/linux/zsmalloc.h | 52 ++ mm/vmalloc.c | 1 + 21 files changed, 2361 insertions(+), 2339 deletions(-) create mode 100644 drivers/block/zram/Kconfig create mode 100644 drivers/block/zram/Makefile create mode 100644 drivers/block/zram/zram.txt create mode 100644 drivers/block/zram/zram_drv.c create mode 100644 drivers/block/zram/zsmalloc.c delete mode 100644 drivers/staging/zram/Kconfig delete mode 100644 drivers/staging/zram/Makefile delete mode 100644 drivers/staging/zram/zram.txt delete mode 100644 drivers/staging/zram/zram_drv.c delete mode 100644 drivers/staging/zram/zram_drv.h delete mode 100644 drivers/staging/zsmalloc/Kconfig delete mode 100644 drivers/staging/zsmalloc/Makefile delete mode 100644 drivers/staging/zsmalloc/zsmalloc-main.c delete mode 100644 drivers/staging/zsmalloc/zsmalloc.h create mode 100644 include/linux/zram.h create mode 100644 include/linux/zsmalloc.h -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/