From: Theodore Tso Subject: Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE? Date: Mon, 26 Apr 2010 10:50:45 -0400 Message-ID: <7B2E5B6F-3C25-4EF5-AC2F-AE62E9C643C2@mit.edu> References: <20100426094837.2E5E.A69D9226@jp.fujitsu.com> Mime-Version: 1.0 (Apple Message framework v1078) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org, Hugh Dickins To: KOSAKI Motohiro Return-path: In-Reply-To: <20100426094837.2E5E.A69D9226@jp.fujitsu.com> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Apr 26, 2010, at 6:18 AM, KOSAK > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing > (and later rd choosed to use another way). > Then, It assume writepage refusing aren't happen on majority pages. > IOW, the VM assume other many pages can writeout although the page = can't. > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is = returned. > but now ext4 and btrfs refuse all writepage(). (right?) No, not exactly. Btrfs refuses the writepage() in the direct reclaim = cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case = of zone scanning. I don't want to speak for Chris, but I assume it's = due to stack depth concerns --- if it was just due to worrying about fs = recursion issues, i assume all of the btrfs allocations could be done = GFP_NOFS. Ext4 is slightly different; it refuses writepages() if the inode blocks = for the page haven't yet been allocated. (Regardless of whether it's = happening for direct reclaim or zone scanning.) However, if the on-disk = block has been assigned (i.e., this isn't a delalloc case), ext4 will = honor the writepage(). (i.e., if this is an mmap of an already = existing file, or if the space has been pre-allocated using = fallocate()). The reason for ext4's concern is lock ordering, = although I'm investigating whether I can fix this. If we call = set_page_writeback() to set PG_writeback (plus set the various bits of = magic fs accounting), and then drop the page_lock, does that protect us = from random changes happening to the page (i.e., from vmtruncate, etc.)? >=20 > IOW, I don't think such documentation suppose delayed allocation issue = ;) >=20 > The point is, Our dirty page accounting only account per-system-memory > dirty ratio and per-task dirty pages. but It doesn't account = per-numa-node > nor per-zone dirty ratio. and then, to refuse write page and fake numa > abusing can make confusing our vm easily. if _all_ pages in our VM LRU > list (it's per-zone), page activation doesn't help. It also lead to = OOM. >=20 > And I'm sorry. I have to say now all vm developers fake numa is not > production level quority yet. afaik, nobody have seriously tested our > vm code on such environment. (linux/arch/x86/Kconfig says "This is = only=20 > useful for debugging".) So I'm sorry I mentioned the fake numa bit, since I think this is a bit = of a red herring. That code is in production here, and we've made all = sorts of changes so ti can be used for more than just debugging. So = please ignore it, it's our local hack, and if it breaks that's our = problem. More importantly, just two weeks ago I talked to soeone in = the financial sector, who was testing out ext4 on an upstream kernel, = and not using our hacks that force 128MB zones, and he ran into the = ext4/OOM problem while using an upstream kernel. It involved Oracle = pinning down 3G worth of pages, and him trying to do a huge streaming = backup (which of course wasn't using fallocate or direct I/O) under = ext4, and he had the same issue --- an OOM, that I'm pretty sure was = caused by the fact that ext4_writepage() was refusing the writepage() = and most of the pages weren't nailed down by Oracle were delalloc. = The same test scenario using ext3 worked just fine, of course. Under normal cases it's not a problem since statistically there should = be enough other pages in the system compared to the number of pages that = are subject to delalloc, such that pages can usually get pushed out = until the writeback code can get around to writing out the pages. But = in cases where the zones have been made artificially small, or you have = a big program like Oracle pinning down a large number of pages, then of = course we have problems.=20 I'm trying to fix things from the file system side, which means trying = to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is = described in Documentation/filesystems/Locking as something which MUST = be used if writepage() is going refuse a page. And then I discovered no = one is actually using it. So that's why I was asking with respect = whether the Locking documentation file was out of date, or whether all = of the file systems are doing it wrong. On a related example of how file system code isnt' necessarily following = what is required/recommended by the Locking documentation, ext2 and ext3 = are both NOT using set_page_writeback()/end_page_writeback(), but are = rather keeping the page locked until after they call = block_write_full_page(), because of concerns of truncate coming in and = screwing things up. But now looking at Locking, it appears that = set_page_writeback() is as good as page_lock() for preventing the = truncate code from coming in and screwing everything up? It's not = clear to me exactly what locking guarantees are provided against = truncate by set_page_writeback(). And suppose we are writing out a = whole cluster of pages, say 4MB worth of pages; do we need to call = set_page_writeback() on every single page in the cluster before we do = the I/O to make sure things don't change out from under us? (I'm pretty = sure at least some of the other filesystems that are submitting huge = numbers of pages using bio instead of 4k at a time like ext2/3/4 aren't = calling set_page_writeback() on all of the pages first.) Part of the problem is that the writeback Locking semantics aren't well = documented, and where they are documented, it's not clear they are up to = date --- and all of the file systems that are doing delayed allocation = writeback are doing things slightly differently, or in some cases very = differently. (And even without delalloc, as I've pointed out ext2/3 = don't use set_page_writeback() --- if this is a MUST USE as implied by = the Locking file, why did whoever added this requirement didn't go in = and modify common filesystems like ext2 and ext3 to use the = set_page_writeback/end_page_writeback calls?) I'm happy to change things in ext4; in fact I'm pretty sure ext4 = probably isn't completely right here. But it's not clear what "right" = actually is, and when I look to see what protects writepage() racing = with vmtruncate(), it's enough to give me a headache. :-( =20 Hence my question about wouldn't it be simpler if we simply added more = high-level locking to prevent truncate from racing against = writepage/writeback. =20 -- Ted -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org