Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755864AbYHGHaj (ORCPT ); Thu, 7 Aug 2008 03:30:39 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752364AbYHGHZS (ORCPT ); Thu, 7 Aug 2008 03:25:18 -0400 Received: from fms-01.valinux.co.jp ([210.128.90.1]:60721 "EHLO mail.valinux.co.jp" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1753334AbYHGHZQ (ORCPT ); Thu, 7 Aug 2008 03:25:16 -0400 Date: Thu, 07 Aug 2008 16:25:12 +0900 (JST) Message-Id: <20080807.162512.22162413.taka@valinux.co.jp> To: kamezawa.hiroyu@jp.fujitsu.com Cc: Balbir Singh , ryov@valinux.co.jp, xen-devel@lists.xensource.com, containers@lists.linux-foundation.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, dm-devel@redhat.com, agk@sourceware.org Subject: Re: [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts From: Hirokazu Takahashi In-Reply-To: <16255819.1218030343593.kamezawa.hiroyu@jp.fujitsu.com> References: <20080804.175748.189722512.ryov@valinux.co.jp> <20080806165421.f76edd47.kamezawa.hiroyu@jp.fujitsu.com> <16255819.1218030343593.kamezawa.hiroyu@jp.fujitsu.com> X-Mailer: Mew version 5.1.52 on Emacs 21.4 / Mule 5.0 (SAKAKI) Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5176 Lines: 123 Hi, > >> > This patch splits the cgroup memory subsystem into two parts. > >> > One is for tracking pages to find out the owners. The other is > >> > for controlling how much amount of memory should be assigned to > >> > each cgroup. > >> > > >> > With this patch, you can use the page tracking mechanism even if > >> > the memory subsystem is off. > >> > > >> > Based on 2.6.27-rc1-mm1 > >> > Signed-off-by: Ryo Tsuruta > >> > Signed-off-by: Hirokazu Takahashi > >> > > >> > >> Plese CC me or Balbir or Pavel (See Maintainer list) when you try this ;) > >> > >> After this patch, the total structure is > >> > >> page <-> page_cgroup <-> bio_cgroup. > >> (multiple bio_cgroup can be attached to page_cgroup) > >> > >> Does this pointer chain will add > >> - significant performance regression or > >> - new race condtions > >> ? > > > >I don't think it will cause significant performance loss, because > >the link between a page and a page_cgroup has already existed, which > >the memory resource controller prepared. Bio_cgroup uses this as it is, > >and does nothing about this. > > > >And the link between page_cgroup and bio_cgroup isn't protected > >by any additional spin-locks, since the associated bio_cgroup is > >guaranteed to exist as long as the bio_cgroup owns pages. > > > Hmm, I think page_cgroup's cost is visible when > 1. a page is changed to be in-use state. (fault or radixt-tree-insert) > 2. a page is changed to be out-of-use state (fault or radixt-tree-removal) > 3. memcg hit its limit or global LRU reclaim runs. > "1" and "2" can be catched as 5% loss of exec throuput. > "3" is not measured (because LRU walk itself is heavy.) > > What new chances to access page_cgroup you'll add ? > I'll have to take into account them. I haven't add any at this moment, but I thinks some people may want to move some pages in page-cache from one cgroup to another cgroup. When that time comes, I'll try to make the cost minimized that I will probably only update the link between a page_cgroup and a bio_cgroup and leave the others untouched. > >I've just noticed that most of overhead comes from the spin-locks > >when reclaiming the pages inside mem_cgroups and the spin-locks to > >protect the links between pages and page_cgroups. > Overhead between page <-> page_cgroup lock is cannot be catched by > lock_stat now.Do you have numbers ? > But ok, there are too many locks ;( The problem is that every time the lock is held, the associated cache line is flushed. > >The latter overhead comes from the policy your team has chosen > >that page_cgroup structures are allocated on demand. I still feel > >this approach doesn't make any sense because linux kernel tries to > >make use of most of the pages as far as it can, so most of them > >have to be assigned its related page_cgroup. It would make us happy > >if page_cgroups are allocated at the booting time. > > > Now, multi-sizer-page-cache is discussed for a long time. If it's our > direction, on-demand page_cgroup make sense. I don't think I can agree to this. When multi-sized-page-cache is introduced, some data structures will be allocated to manage multi-sized-pages. I think page_cgroups should be allocated at the same time. This approach will make things simple. It seems like the on-demand allocation approach leads not only overhead but complexity and a lot of race conditions. If you allocate page_cgroups when allocating page structures, You can get rid of most of the locks and you don't have to care about allocation error of page_cgroups anymore. And it will also give us flexibility that memcg related data can be referred/updated inside critical sections. > >> For example, adding a simple function. > >> == > >> int get_page_io_id(struct page *) > >> - returns a I/O cgroup ID for this page. If ID is not found, -1 is returne > d. > >> ID is not guaranteed to be valid value. (ID can be obsolete) > >> == > >> And just storing cgroup ID to page_cgroup at page allocation. > >> Then, making bio_cgroup independent from page_cgroup and > >> get ID if avialble and avoid too much pointer walking. > > > >I don't think there are any diffrences between a poiter and ID. > >I think this ID is just a encoded version of the pointer. > > > ID can be obsolete, pointer is not. memory cgroup has to take care of > bio cgroup's race condition ? (About race conditions, it's already complicated > enough) Bio-cgroup just expects that the call-backs bio-cgroup prepares are called when the status of a page_cgroup get changed. > To be honest, I think adding a new (4 or 8 bytes) page struct and record infor > mation of bio-control is more straightforward approach. Buy as you might > think, "there is no room" But only if everyone allows me to add some new members into "struct page." I think the same thing goes with memcg you're working on. Thank you, Hirokazu Takahashi. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/