From: Bruce Allen Subject: Re: slow NFS performance extracting bits from large files Date: Mon, 16 Dec 2002 14:10:59 -0600 (CST) Sender: nfs-admin@lists.sourceforge.net Message-ID: References: Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Bruce Allen Return-path: Received: from uwm.edu ([129.89.7.2] ident=root) by sc8-sf-list1.sourceforge.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 18O1aF-0001YZ-00 for ; Mon, 16 Dec 2002 12:11:23 -0800 To: Paul Haas , paulh@hamjudo.com, nfs@lists.sourceforge.net In-Reply-To: Errors-To: nfs-admin@lists.sourceforge.net List-Help: List-Post: List-Subscribe: , List-Id: Discussion of NFS under Linux development, interoperability, and testing. List-Unsubscribe: , List-Archive: Hi Paul, Well, this thread is about to terminate. You nailed it the first time. The results are (after dirtying the local disk cache): On local disk: 2 m 45 s = 165 secs. On NFS disk: 2 min 50 s = 170 secs. In the long run we may want to reformat/redistribute the data so that we're not seek-limited like this. Thank you very much for explaning this to me. I was really puzzled about what was going on. Bruce Allen On Mon, 16 Dec 2002, Bruce Allen wrote: > Hi Paul, > > Thanks for the quick reply. > > > > I am working on a scientific data analysis application which needs to > > > extract a little bit of data from around 10,000 files. The files > > > themselves are each around 1 MB in size. The application does: > > > > > > for (i=0; i<10000; i++){ > > > open() ith file > > > read() 32 bytes from beginning > > > lseek() to somewhere in the middle of the file > > > read() 620 bytes from the middle of the file > > > close() ith file > > > } > > > > So that's 20,000 reads. Readahead wouldn't help much with this access > > pattern, so it looks like 20,000 reads and 20,000 seeks. > > > > If I run this with the data set on a local disk, the total run time is > > > around 1 sec -- very reasonable since I am transferring around 6 MB of > > > data. > > > > Ok, 20,000 seeks per second, shows an average seek time of 50 > > microseconds. You can't do that with a spinning disk. You've discovered > > how well disk caching works. > > Thank you very much. This is a reasonable (obvious!) explanation. Let me > see if I undertand correctly. > > I just looked up the specs for the local disk that I was using for the > comparison testing. It averages around 600 512-byte sectors (300 kB) per > track, and 1ms average track to track seek time. So (assuming little > fragmentation) a typical file is spread across 3 to 4 tracks of the disk. > Thus, if I seek to a data set in the middle of a file, and assuming that > the files are "more or less" contiguous on the disk, a typical iteration > of the loop above starting from the first read, should involve: > > read 32 bytes > seek to middle (2 tracks) 2 msec > read 620 bytes > seek to next file (2 tracks) 2 msec > > hence 4 msec/file x 10k files = 40 sec read time if the kernel disk cache > buffers are all dirty. Does the estimate above look reasonable? > > I'll do the experiment that you suggest, using wc to clear the cache > first, and see how this compares to what's above. > > > > If I run this on an NFS-mounted disk the performance is 100 times worse. > > > Both client and server are stock RH 7.3. Note that if I run it a second > > > time, when the data is presumably cached, it takes just a few seconds to > > > run. > > > > 20000 seeks in 100 seconds, that's an average seek time of 5 ms. That's > > possible on real hardware. > > The files on the NFS server live on an ATA RAID array, made of 3-ware 7850 > controller running in hardware RAID-5 with the data striped across 7 disks > (the 8th disk is a hot spare). I think the stripe size is configured to > the maximum value, but I have forgotten what that is. I have no idea what > the equivalent "seek time" on such a system is. > > The client and the server are networked via gigabit ethernet, through a > Foundry Networks switch. Neither the switch nor the client or server are > oversubscribed. > > > > Is it obvious what needs to be changed/tuned to improve this performance? > > > Or if not, could someone suggest a tool and a strategy for tracing down > > > the cause of the poor performance? > > > > Clear the cache in the system with the local disk and measure the > > performance. You can clear the cache by reading files bigger than all of > > RAM. I tend to use "wc *" in a directory with some large tar files. wc > > kindly tells me how many bytes I've read, so I know if I've read enough. > > > > With a clear cache you'll see local performance numbers that are closer to > > the NFS numbers. > > OK, I'll try this. In retrospect, it's obvious. > > > What's the ping time? The NFS performance should have a penalty of > > something like 3 network transactions per file, so your NFS times should > > be 30,000 packet round trips longer than the local case. (maybe 4 network > > transactions per file if there is also a stat call, a network trace would > > show all the transactions). > > There's no stat call -- I've used strace to verify that I am only doing > the operations described in my pseudo-code above. > > open("/scratch/xavi/CAL_L1_Data//CAL_SFT.714299460", O_RDONLY) = 3 > read(3, "\0\0\0\0\0\0\360?DX\223*\0\0\0\0\0\0\0\0\0\0N@\270\v\0"..., 32) = > 32 > lseek(3, 460776, SEEK_CUR) = 460808 > read(3, "\336\203z&\tM\216&/\342\227\244\307\333\352\245\233\27"..., > 620) = 620 > close(3) = 0 > open("/scratch/xavi/CAL_L1_Data//CAL_SFT.714299520", O_RDONLY) = 3 > > There's no stat(). So I think it's 3 not 4. Here are the ping times: > > [ballen@medusa-slave123 ballen]$ ping storage1 > PING storage1.medusa.phys.uwm.edu (129.89.201.244) from 129.89.200.133 > : 56(84) bytes of data. > 64 bytes from storage1.medusa.phys.uwm.edu (129.89.201.244): icmp_seq=1 > ttl=255 time=0.197 ms > 64 bytes from storage1.medusa.phys.uwm.edu (129.89.201.244): icmp_seq=2 > ttl=255 time=0.264 ms > .... > --- storage1.medusa.phys.uwm.edu ping statistics --- > 26 packets transmitted, 26 received, 0% loss, time 25250ms > rtt min/avg/max/mdev = 0.156/0.230/0.499/0.063 ms > > so 230 usec (average) x 30,000 = 7 seconds > > OK, so if all is well, then if I run the code locally on the NFS server > itself (with disk cache buffers all dirty) it should take around 7 > seconds less time than if I run it on an NFS client. > > I'll check the numbers and report back. > > Thanks again for the help. I'd not thought carefully enough about what > the disk was really doing. The point being that reading a contiguous 6 MB > file from the local disk only involves ~20 track-to-track seeks, which is > negligable in comparison with what I am doing. > > Cheers, > Bruce > > > > ------------------------------------------------------- > This sf.net email is sponsored by: > With Great Power, Comes Great Responsibility > Learn to use your power at OSDN's High Performance Computing Channel > http://hpc.devchannel.org/ > _______________________________________________ > NFS maillist - NFS@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nfs > ------------------------------------------------------- This sf.net email is sponsored by: With Great Power, Comes Great Responsibility Learn to use your power at OSDN's High Performance Computing Channel http://hpc.devchannel.org/ _______________________________________________ NFS maillist - NFS@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs