From: Chris Mason Subject: Re: [PATCH 1/2] fs: Do not dispatch FITRIM through separate super_operation Date: Fri, 19 Nov 2010 14:29:03 -0500 Message-ID: <1290194167-sup-3275@think> References: <1290100750.3041.72.camel@mulgrave.site> <1290102098.3041.77.camel@mulgrave.site> <4CE59E57.2090009@teksavvy.com> <4CE5C616.7070706@teksavvy.com> <20101119115516.GA1152@infradead.org> <20101119163828.GA8023@infradead.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Christoph Hellwig , Greg Freemyer , Mark Lord , "Martin K. Petersen" , James Bottomley , Jeff Moyer , Matthew Wilcox , Josef Bacik , tytso , linux-ext4 , linux-kernel , linux-fsdevel , sandeen To: Lukas Czerner Return-path: Received: from rcsinet10.oracle.com ([148.87.113.121]:58264 "EHLO rcsinet10.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756250Ab0KSTba (ORCPT ); Fri, 19 Nov 2010 14:31:30 -0500 In-reply-to: Sender: linux-ext4-owner@vger.kernel.org List-ID: Excerpts from Lukas Czerner's message of 2010-11-19 13:06:16 -0500: > On Fri, 19 Nov 2010, Christoph Hellwig wrote: > > > On Fri, Nov 19, 2010 at 08:20:58AM -0800, Greg Freemyer wrote: > > > The kernel team has been coding around some Utopian SSD TRIM > > > implementation for at least 2 years with the basic assumption that > > > SSDs can handle thousands of trims per second. Just mix em in with > > > the rest of the i/o. No problem. Intel swore to us its the right > > > thing to do. > > > > Thanks Greg, good that you told us what we've been doing. I would have > > forgot myself if you didn't remember me. > > > > > I'm still waiting to see the first benchmark report from anywhere > > > (SSD, Thin Provisioned SCSI) that the online approach used by mount -o > > > discard is a win performance wise. Linux has a history of designing > > > for reality, but for some reason when it comes to SSDs reality seems > > > not to be a big concern. > > > > Both Lukas and I have done extensive benchmarks on various SSDs and > > thinkly provisioned raids. Unfortunately most of the hardware is only > > available under NDA so we can't publish it. > > > > For the XFS side which I've looked it I can summarize that we do have > > arrays that do the online discard without measureable performance > > penalty on various workloads, and we have devices (both SSDs and arrays) > > where the overhead is incredibly huge. I can also say that doing the > > walk of the freespace btrees similar to the offline discard, but every > > 30 seconds or at a similarly high interval is a sure way to completely > > kill performance. > > > > Or in short we haven't fund the holy grail yet. > > > > Indeed we have not. But speaking of benchmarks I have just finished > quick run (well, not so quick:)) of my discard-kit for btrfs filesystem > and here are results. Note that tool used for this benchmark is > postmark, hence it might not be the realest use-case, but it provides > nice comparison between ext4 (below) and btrfs online discard > implementation (FITRIM is NOT involved). Thanks a lot for posting these, I know it takes forever to run them. I hesitate to trust postmark too much for comparing the ext4 trim with the btrfs trim because we might have dramatically different lifetimes on the blocks. So if I manage to just do fewer allocations than ext4, I'll also do fewer trims. I'd also be curious to see how many trims each of us did, maybe running w/blktrace could show that? The btrfs online discard will trim all the metadata blocks as they are freed, and in a COW filesystem this makes for a very noisy trim. We could reduce our trim load considerably by only trimming data blocks, and only trimming metadata when we make a big free extent. The default btrfs options duplicate metadata, so we actually end up doing 2 trims for every metadata block we free. At any rate, I definitely think both the online trim and the FITRIM have their uses. One thing that has burnt us in the past is coding too much for the performance of the current crop of ssds when the next crop ends up making our optimizations useless. This is the main reason I think the online trim is going to be better and better. The FS has a ton of low hanging fruit in there and the devices are going to improve. At some point the biggest perf problem will just be the non-queueable trim command. One thing I haven't seen benchmarked is how trim changes the performance of the SSD as the poor little log structured squirrels inside run out of places to store things. Does it get rid of the cliffs in performance as the drive ages, and how do we measure that in general? -chris > > > (Sadly the table is too wide so you have to...well, you guys can manage > it somehow, right?). > > BTRFS > ----- > > | BUFFERING ENABLED | BUFFERING DISABLED | > -------------------------------------------------------------------------------------------------------------- > Type |NODISCARD DISCARD DIFF |NODISCARD DISCARD DIFF | > ============================================================================================================== > Total_duration |230.90 336.20 45.60% |232.00 335.00 44.40% | > Duration_of_transactions |159.60 266.10 66.73% |158.90 264.60 66.52% | > Transactions/s |313.32 188.01 -39.99% |314.70 189.07 -39.92% | > Files_created/s |323.84 222.48 -31.30% |322.28 223.28 -30.72% | > Creation_alone/s |778.08 796.37 2.35% |756.66 787.68 4.10% | > Creation_mixed_with_transaction/s |155.16 93.11 -39.99% |155.84 93.63 -39.92% | > Read/s |156.50 93.91 -39.99% |157.18 94.44 -39.92% | > Append/s |156.82 94.10 -39.99% |157.50 94.63 -39.92% | > Deleted/s |323.84 222.48 -31.30% |322.28 223.28 -30.72% | > Deletion_alone/s |770.64 788.75 2.35% |749.42 780.15 4.10% | > Deletion_mixed_with_transaction/s |158.16 94.90 -40.00% |158.85 95.44 -39.92% | > Read_B/s |11925050.90 8192800.35 -31.30% |11867797.20 8221997.40 -30.72% | > Write_B/s |37318466.00 25638695.00 -31.30% |37139294.00 25730064.60 -30.72% | > ============================================================================================================== > > EXT4 > ---- > | BUFFERING ENABLED | BUFFERING DISABLED | > -------------------------------------------------------------------------------------------------------------- > Type |NODISCARD DISCARD DIFF |NODISCARD DISCARD DIFF | > ============================================================================================================== > Total_duration |306.10 512.70 67.49% |301.60 516.10 71.12% | > Duration_of_transactions |243.50 449.80 84.72% |239.00 453.90 89.92% | > Transactions/s |205.43 111.19 -45.87% |209.32 110.17 -47.37% | > Files_created/s |244.30 145.85 -40.30% |247.97 144.87 -41.58% | > Creation_alone/s |834.88 830.60 -0.51% |830.60 833.42 0.34% | > Creation_mixed_with_transaction/s |101.73 55.06 -45.88% |103.66 54.55 -47.38% | > Read/s |102.61 55.54 -45.87% |104.55 55.03 -47.36% | > Append/s |102.82 55.65 -45.88% |104.76 55.14 -47.37% | > Deleted/s |244.30 145.85 -40.30% |247.97 144.87 -41.58% | > Deletion_alone/s |826.90 822.66 -0.51% |822.66 825.46 0.34% | > Deletion_mixed_with_transaction/s |103.70 56.13 -45.87% |105.66 55.61 -47.37% | > Read_B/s |8996110.60 5370694.40 -40.30% |9131349.20 5334560.40 -41.58% | > Write_B/s |28152588.40 16807146.60 -40.30% |28575806.40 16694068.00 -41.58% | > ============================================================================================================== > > > (Buffering means that C library function like fopen, fread, fwrite are > used instead of open, read, write. I have used the word buffering in the > same way as it is used in the postmark test) > > So, you can see that Btrfs handles online discard quite better than ext4 > (cca 20% difference), but it is still pretty massive performance loss on > not-so-good-but-I-have-seen-worse SSD. So, I would say that you guys > (Josef?) should at least consider the possibility of using FITRIM as well. > > Thanks! > > -Lukas