Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755977AbYAWRyv (ORCPT ); Wed, 23 Jan 2008 12:54:51 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752873AbYAWRyo (ORCPT ); Wed, 23 Jan 2008 12:54:44 -0500 Received: from styx.suse.cz ([82.119.242.94]:42907 "EHLO duck.suse.cz" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752433AbYAWRyn (ORCPT ); Wed, 23 Jan 2008 12:54:43 -0500 Date: Wed, 23 Jan 2008 18:54:41 +0100 From: Jan Kara To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Mark Fasheh Subject: [PATCH RESEND] Handle i_size > s_maxbytes correctly Message-ID: <20080123175440.GF10144@duck.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.16 (2007-06-09) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12195 Lines: 319 Hi, I'm resending improved patch for the problem: Shortly, the problem is that we can mount a filesystem which has files with i_size greater than s_maxbytes (or generally the maximum size VFS can handle) - the easiest to test is to export NFS from a 64 bit machine to a 32 bit one. Create a 16TB+4KB (sparse) file on the server and you'll notice that you can read only first 4 KB from the client (because end_index overflows in do_generic_mapping_read()). Clustered filesystems have similar problems and classical local filesystems can be affected as well (if you create a filesystem on a different machine than where you read it). The patch introduces two functions for conversion of an offset into index / block number and these functions handle overflows. Andrew was concerned about the cost of checks for such exotic cases, so currently the cost is around 600 bytes of text size on i386 without LFS, 74 bytes on x86_64, impact on run time should be quite small (only one additional test when converting i_size to indexes or block numbers). A different solution (even with smaller impact) would be to not allow files with i_size > s_maxbytes in VFS at all. For local filesystems we can just check this on open and everything is fine but with remote filesystems such as OCFS2 (or NFS) filesize can be changed on the fly from a different machine. So to avoid problems we can either introduce some locking to prevent changes of i_size from other machines while we are in critical sections (awww, I really don't think this is better) or truncate i_size to s_maxbytes when we update i_size from what we've received via network / shared storage (but then we'd have to track whether user truncated file to some size or whether fs truncated it just for safety and apps could be confused too). So I don't think this is really feasible. Any suggestions for further improvement or other solutions welcome. Andrew, can I talk you to including this version of the fix ;)? Honza -- Jan Kara SUSE Labs, CR --- Although we don't allow writes over s_maxbytes, it can happen that a file's size is larger than s_maxbytes. For example we can write the file from a computer with a different architecture (which has larger s_maxbytes), boot a kernel with a different set of config options (CONFIG_LBD...), or if two nodes in a [Ocfs2, and likely Gfs2] cluster have mounted the same file system and have different s_maxbytes. Thus we have to make sure we don't crash / corrupt data when seeing such file (page offset of the last page needn't fit into pgoff_t, block offset into sector_t). Firstly, we make read() and mmap() return error when user tries to access the file above s_maxbytes, secondly we introduce functions offset_to_index_max() and offset_to_block_max() which return maximum number fitting into pgoff_t or sector_t, respectively, in case the number computed from the given offset would overflow. These functions should be used for conversion whenever we aren't sure whether the result fits in the destination type (which is actually only if converting i_size because all other cases are handled earlier in read/write/mmap). Signed-off-by: Jan Kara CC: Mark Fasheh diff --git a/fs/buffer.c b/fs/buffer.c index 7249e01..4323a6d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1623,7 +1623,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page, BUG_ON(!PageLocked(page)); - last_block = (i_size_read(inode) - 1) >> inode->i_blkbits; + last_block = offset_to_block_max(i_size_read(inode) - 1, + inode->i_blkbits); if (!page_has_buffers(page)) { create_empty_buffers(page, blocksize, @@ -2084,7 +2085,8 @@ int block_read_full_page(struct page *page, get_block_t *get_block) head = page_buffers(page); iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits); - lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits; + lblock = offset_to_block_max(i_size_read(inode)+blocksize-1, + inode->i_blkbits); bh = head; nr = 0; i = 0; @@ -2604,7 +2606,7 @@ int nobh_writepage(struct page *page, get_block_t *get_block, { struct inode * const inode = page->mapping->host; loff_t i_size = i_size_read(inode); - const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; + const pgoff_t end_index = offset_to_index_max(i_size); unsigned offset; int ret; @@ -2804,7 +2806,7 @@ int block_write_full_page(struct page *page, get_block_t *get_block, { struct inode * const inode = page->mapping->host; loff_t i_size = i_size_read(inode); - const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; + const pgoff_t end_index = offset_to_index_max(i_size); unsigned offset; /* Is the page fully inside i_size? */ diff --git a/fs/mpage.c b/fs/mpage.c index d54f8f8..4d1d12c 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -190,7 +190,8 @@ do_mpage_readpage(struct bio *bio, struct page *page, unsigned nr_pages, block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); last_block = block_in_file + nr_pages * blocks_per_page; - last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits; + last_block_in_file = offset_to_block_max(i_size_read(inode) + + blocksize - 1, blkbits); if (last_block > last_block_in_file) last_block = last_block_in_file; page_block = 0; @@ -526,7 +527,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, */ BUG_ON(!PageUptodate(page)); block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); - last_block = (i_size - 1) >> blkbits; + last_block = offset_to_block_max(i_size - 1, blkbits); map_bh.b_page = page; for (page_block = 0; page_block < blocks_per_page; ) { @@ -557,7 +558,7 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc, first_unmapped = page_block; page_is_mapped: - end_index = i_size >> PAGE_CACHE_SHIFT; + end_index = offset_to_index_max(i_size); if (page->index >= end_index) { /* * The page straddles i_size. It must be zeroed out on each diff --git a/fs/read_write.c b/fs/read_write.c index ea1f94c..ed91acc 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "read_write.h" #include @@ -263,6 +264,11 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) return -EINVAL; if (unlikely(!access_ok(VERIFY_WRITE, buf, count))) return -EFAULT; + if (unlikely(*pos + count > file->f_vfsmnt->mnt_sb->s_maxbytes)) { + if (*pos >= file->f_vfsmnt->mnt_sb->s_maxbytes) + return -EOVERFLOW; + count = file->f_vfsmnt->mnt_sb->s_maxbytes - *pos; + } ret = rw_verify_area(READ, file, pos, count); if (ret >= 0) { diff --git a/include/linux/fs.h b/include/linux/fs.h index b3ec4a4..826e718 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1394,6 +1394,17 @@ static inline void inode_dec_link_count(struct inode *inode) mark_inode_dirty(inode); } +/* Convert byte offset to block number, if number would overflow + * sector_t, return maximum value fitting there. */ +static inline sector_t offset_to_block_max(loff_t off, int shift) +{ + loff_t block = off >> shift; + + if (unlikely(block != (sector_t)block)) + return ~(sector_t)0; + return block; +} + extern void touch_atime(struct vfsmount *mnt, struct dentry *dentry); static inline void file_accessed(struct file *file) { diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index db8a410..adfd195 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -148,6 +148,19 @@ static inline loff_t page_offset(struct page *page) return ((loff_t)page->index) << PAGE_CACHE_SHIFT; } +/* + * Return index of a page on given offset, if index would overflow pgoff_t, + * return the maximum number fitting there. + */ +static inline pgoff_t offset_to_index_max(loff_t off) +{ + loff_t index = off >> PAGE_CACHE_SHIFT; + + if (unlikely(index != (pgoff_t)index)) + return ~((pgoff_t)0); + return index; +} + static inline pgoff_t linear_page_index(struct vm_area_struct *vma, unsigned long address) { diff --git a/mm/filemap.c b/mm/filemap.c index 01e260f..e37be0c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -380,7 +380,7 @@ int filemap_fdatawait(struct address_space *mapping) return 0; return wait_on_page_writeback_range(mapping, 0, - (i_size - 1) >> PAGE_CACHE_SHIFT); + offset_to_index_max(i_size - 1)); } EXPORT_SYMBOL(filemap_fdatawait); @@ -925,7 +925,7 @@ page_ok: */ isize = i_size_read(inode); - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + end_index = offset_to_index_max(isize - 1); if (unlikely(!isize || index > end_index)) { page_cache_release(page); goto out; @@ -1310,7 +1310,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) int did_readaround = 0; int ret = 0; - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + size = offset_to_index_max(i_size_read(inode) + PAGE_CACHE_SIZE - 1); if (vmf->pgoff >= size) return VM_FAULT_SIGBUS; @@ -1385,7 +1385,7 @@ retry_find: goto page_not_uptodate; /* Must recheck i_size under page lock */ - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + size = offset_to_index_max(i_size_read(inode) + PAGE_CACHE_SIZE - 1); if (unlikely(vmf->pgoff >= size)) { unlock_page(page); page_cache_release(page); diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 5fa3fbf..e264b37 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -68,7 +68,7 @@ do_xip_mapping_read(struct address_space *mapping, if (!isize) goto out; - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + end_index = offset_to_index_max(isize - 1); for (;;) { struct page *page; unsigned long nr, ret; @@ -220,7 +220,7 @@ static int xip_file_fault(struct vm_area_struct *area, struct vm_fault *vmf) /* XXX: are VM_FAULT_ codes OK? */ - size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + size = offset_to_index_max(i_size_read(inode) + PAGE_CACHE_SIZE - 1); if (vmf->pgoff >= size) return VM_FAULT_SIGBUS; diff --git a/mm/mmap.c b/mm/mmap.c index 15678aa..e28ca6e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -983,6 +983,13 @@ unsigned long do_mmap_pgoff(struct file * file, unsigned long addr, if (locks_verify_locked(inode)) return -EAGAIN; + /* + * Make sure we don't map more than fs is able to handle + */ + if ((((loff_t)pgoff) << PAGE_SHIFT) + len > + inode->i_sb->s_maxbytes) + return -EINVAL; + vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); diff --git a/mm/readahead.c b/mm/readahead.c index c9c50ca..67ccd91 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -136,7 +136,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp, if (isize == 0) goto out; - end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); + end_index = offset_to_index_max(isize - 1); /* * Preallocate as many pages as we will need. diff --git a/mm/swapfile.c b/mm/swapfile.c index f071648..0951c7a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1082,7 +1082,7 @@ static int setup_swap_extents(struct swap_info_struct *sis, sector_t *span) */ probe_block = 0; page_no = 0; - last_block = i_size_read(inode) >> blkbits; + last_block = offset_to_block_max(i_size_read(inode), blkbits); while ((probe_block + blocks_per_page) <= last_block && page_no < sis->max) { unsigned block_in_page; @@ -1517,6 +1517,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) goto bad_swap; } + /* This can possibly overflow but we discover that later */ swapfilesize = i_size_read(inode) >> PAGE_SHIFT; /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/