From: Ric Wheeler Subject: batching support for transactions Date: Tue, 02 Oct 2007 08:57:53 -0400 Message-ID: <47024051.2030303@emc.com> Reply-To: ric@emc.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Feld, Andy" , Jens Axboe To: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, reiserfs-devel@vger.kernel.org Return-path: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org After several years of helping tune file systems for normal (ATA/S-ATA) drives, we have been doing some performance work on ext3 & reiserfs on disk arrays. One thing that jumps out is that the way we currently batch synchronous work loads into transactions does really horrible things to performance for storage devices which have really low latency. For example, one a mid-range clariion box, we can use a single thread to write around 750 (10240 byte) files/sec to a single directory in ext3. That gives us an average time around 1.3ms per file. With 2 threads writing to the same directory, we instantly drop down to 234 files/sec. The culprit seems to be the assumptions in journal_stop() which throw in a call to schedule_timeout_uninterruptible(1): /* * Implement synchronous transaction batching. If the handle * was synchronous, don't force a commit immediately. Let's * yield and let another thread piggyback onto this transaction. * Keep doing that while new threads continue to arrive. * It doesn't cost much - we're about to run a commit and sleep * on IO anyway. Speeds up many-threaded, many-dir operations * by 30x or more... * * But don't do this if this process was the most recent one to * perform a synchronous write. We do this to detect the case where a * single process is doing a stream of sync writes. No point in waiting * for joiners in that case. */ pid = current->pid; if (handle->h_sync && journal->j_last_sync_writer != pid) { journal->j_last_sync_writer = pid; do { old_handle_count = transaction->t_handle_count; schedule_timeout_uninterruptible(1); } while (old_handle_count != transaction->t_handle_count); } reiserfs and ext4 have similar if not exactly the same logic. What seems to be needed here is either a static per file system/storage device tunable to allow us to change this timeout (maybe with "0" defaulting back to the old reiserfs trick of simply doing a yield()?) or a more dynamic, per device way to keep track of the average time it takes to commit a transaction to disk. Based on that rate, we could dynamically adjust our logic to account for lower latency devices. A couple of last thoughts. One, if for some reason you don't have a low latency storage array handy and want to test this for yourselves, you can test the worst case by using a ram disk. The test we used was fs_mark with 10240 bytes files, writing to one shared directory with varying the numbers of threads from 1 up to 40. In the ext3 case, it takes 8 concurrent threads to catch up to the single thread writing case. We are continuing to play with the code and try out some ideas, but I wanted to bounce this off the broader list to see if this makes sense... ric