Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757431AbYGaRBC (ORCPT ); Thu, 31 Jul 2008 13:01:02 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756732AbYGaRAa (ORCPT ); Thu, 31 Jul 2008 13:00:30 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:44386 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754216AbYGaRA1 (ORCPT ); Thu, 31 Jul 2008 13:00:27 -0400 Date: Thu, 31 Jul 2008 09:56:02 -0700 (PDT) From: Linus Torvalds To: Evgeniy Polyakov cc: Jamie Lokier , Miklos Szeredi , jens.axboe@oracle.com, akpm@linux-foundation.org, nickpiggin@yahoo.com.au, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v3] splice: fix race with page invalidation In-Reply-To: <20080731132953.GB1120@2ka.mipt.ru> Message-ID: References: <20080731001131.GA30900@shareable.org> <20080731004214.GA32207@shareable.org> <20080731061201.GA7156@shareable.org> <20080731102612.GA29766@2ka.mipt.ru> <20080731123350.GB16481@shareable.org> <20080731132953.GB1120@2ka.mipt.ru> User-Agent: Alpine 1.10 (LFD 962 2008-03-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5309 Lines: 117 On Thu, 31 Jul 2008, Evgeniy Polyakov wrote: > > It depends... COW can DoS the system: consider attacker who sends a > page, writes there, sends again and so on in lots of threads. Depending > on link capacity eventually COW will eat the whole RAM. Yes, COW is complex, and the complexity would be part of the cost. But the much bigger cost is the fact that COW is simply most costly than copying the data in the first place. A _single_ page fault is usually much much more expensive than copying a page, especially if you can do the copy well wrt caches. For example, a very common case is that the data you're writing is already in the CPU caches. In fact, even if you can avoid the fault, the cost of doing all the locking and looking up the pages for COW is likely already bigger than the memcpy. The memcpy() is a nice linear access which both the CPU and the memory controller can optimize and can get almost perfect CPU throughput for. In contrast, doing a COW implies a lot of random walking over multiple data structures. And _if_ it's all in cache, it's probably ok, but you're totally screwed if you need to send an IPI to another CPU to actually flush the TLB (to make the COW actually take effect!). So yes, you can win by COW'ing, but it's rare, and it mainly happens in benchmarks. For example, I had a trial patch long long ago (I think ten years by now) to do pipe transfers as copying pages around with COW. It was absolutely _beautiful_ in benchmarks. I could transfer gigabytes per second, and this was on something like a Pentium/MMX which had what, 7-10MB/s memcpy performance? In other words, I don't dispute at all that COW schemes can perform really really well. HOWEVER - the COW scheme actually performed _worse_ in any real world benchmark, including just compiling the kernel (we used to use -pipe to gcc to transfer data between cc/as). The reason? The benchmark worked _really_ well, because what it did was basically to do a trivial microbenchmark that did for (;;) { write(fd, buffer, bufsize); } and do you see something unrealistic there? Right: it never actually touched the buffer itself, so it would not ever actually trigger the COW case, and as a result, the memory got marked read-only on the first time, and it never ever took a fault, and in fact the TLB never ever needed to be flushed after the first one because the page was already marked read-only. That's simply not _realistic_. It's hardly how any real load work. > > > There was a linux aio_sendfile() too. Google still knows about its > > > numbers, graphs and so on... :) > > > > I vaguely remember it's performance didn't seem that good. > > > Benchmark of the 100 1MB files transfer (files are in VFS already) using > sync sendfile() against aio_sendfile_path() shows about 10MB/sec > performance win (78 MB/s vs 66-72 MB/s over 1 Gb network, sendfile > sending server is one-way AMD Athlong 64 3500+) for aio_sendfile_path(). > > > So, it was really better that sync sendfile :) I suspect it wasn't any better with small files and small transfers. Yes, some people do big files. Physicists have special things where they get a few terabytes per second from some high-energy experiment. The various people spying on you have special setups where they move gigabytes of satellite map data around to visualize it. So there are undeniably cases like that, but they are also usually so special that they really don't even care about COW, because they sure as hell don't care about somebody else modifying the file they're sending at the same time. In fact the whole point is that they don't touch the data at teh CPU _at_all_, and the reason they want zero-copy sending is that they literally want to DMA from disk buffers to memory, and then from memory to a network interface, and they don't want the CPU _ever_ seeing it with all the cache invalidations etc. And _that_ is the case where you should use sendfile(). If your CPU has actually touched the data, you should probably just use read()/write(). Of course, one of the really nice things about splice() (which was not true about sendfile()) is that you can actually mix-and-match. You can splice data from kernel buffers, but you can also splice data from user VM, or you can do regular "write()" calls to fill (or read) the data from the splice pipe. This is useful in ways that sendfile() never was. You can write() headers into the pipe buffer, and then splice() the file data into it, and the user only sees a pipe (and can either read from it or splice it or tee it or whatever). IOW, splice very much has the UNIX model of "everything is a pipe", taken to one (admittedly odd) extreme. Anyway, the correct way to use splice() is to either just know the data is "safe" (either because you are _ok_ with people modifying it after the splice op took place, or because you know nobody will). The alternative is to expect an acknowledgement from the other side, because then you know the buffer is done. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/