Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
the embedded Linux community for years, updated for 2.6.21. Tested on
several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
Freescale iMX31ADS.
---
Richard
---
On Tue, 22 May 2007 15:09:39 -0700
Richard Griffiths <[email protected]> wrote:
> Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
> the embedded Linux community for years, updated for 2.6.21. Tested on
> several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
> Freescale iMX31ADS.
It's good to see some of these secret embedded patches heading in
the mainline direction.
But we'd expected and hoped that flash-based XIP would be able to use
the existing xip infrastructure, in mm/filemap_xip.c. Not possible?
The de-constification of cramfs_ops and cramfs_dir_inode_operations
was, I assume, a mistake?
On Tue, 2007-05-22 at 15:49 -0700, Andrew Morton wrote:
> On Tue, 22 May 2007 15:09:39 -0700
> Richard Griffiths <[email protected]> wrote:
>
> > Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
> > the embedded Linux community for years, updated for 2.6.21. Tested on
> > several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
> > Freescale iMX31ADS.
>
> It's good to see some of these secret embedded patches heading in
> the mainline direction.
>
> But we'd expected and hoped that flash-based XIP would be able to use
> the existing xip infrastructure, in mm/filemap_xip.c. Not possible?
I don't know I'll look into it and see if it is possible.
>
> The de-constification of cramfs_ops and cramfs_dir_inode_operations
> was, I assume, a mistake?
oops. Yes, Thanks Andrew.
Richard
Andrew Morton wrote:
> But we'd expected and hoped that flash-based XIP would be able to use
> the existing xip infrastructure, in mm/filemap_xip.c. Not possible?
Thanks for the heads up Andrew. Reading the cramfs patch, I've found a
lot of similarities between the current mainline xip stack and this
nice piece of work. I also observed interresting differences:
The current xip stack does use the block device abstraction added with
an extra direct memory access operation. The cramfs patch does not
abstract between the memory segment and the file system, it rather
uses an address as mount parameter.
Both methods can coexist, and a file system that does not use the
block device abstraction may still provide get_xip_page address space
operation and benefit from the xip stack infrastructure. Each method
makes sense in the context of the corresponding file systems.
The current xip stack relies on having struct page behind the memory
segment. This causes few impact on memory management, but occupies
some more memory. The cramfs patch chose to modify copy on write in
order to deal with vmas that don't have struct page behind.
So far, Hugh and Linus have shown strong opposition against copy on
write with no struct page behind. If this implementation is acceptable
to the them, it seems preferable to me over wasting memory. The xip
stack should be modified to use this vma flag in that case.
This implementation extends cramfs while the current xip stack extends
ext2. I would love to see a cramfs implementation based on
filemap_xip, choice is always a good thing. To me, it looks like this
work can be modified to use filemap_xip.
so long,
Carsten
On Tuesday 22 May 2007 23:09:39 Richard Griffiths wrote:
> Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
> the embedded Linux community for years, updated for 2.6.21. Tested on
> several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
> Freescale iMX31ADS.
> +#else /* CONFIG_CRAMFS_LINEAR */
> + /*
> + * The physical location of the cramfs image is specified as
> + * a mount parameter. This parameter is mandatory for obvious
You've used spaces here instead of tabs, probably worth replacing.
> +/*
> + * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
> + */
> +static inline void break_cow(struct vm_area_struct * vma, struct page *
new_page, unsigned long address,
> + pte_t *page_table)
> +{
> + pte_t entry;
Here again.
> + /*
> + * Handle COW of XIP memory.
> + * Note that the source memory actually isn't a ram
> + * page so no struct page is associated to the
source
> + * pte.
> + */
This whole section too.
--
Cheers,
Alistair.
Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
On Wed, 2007-05-23 at 09:51 +0200, Carsten Otte wrote:
> Andrew Morton wrote:
> > But we'd expected and hoped that flash-based XIP would be able to use
> > the existing xip infrastructure, in mm/filemap_xip.c. Not possible?
> Thanks for the heads up Andrew. Reading the cramfs patch, I've found a
> lot of similarities between the current mainline xip stack and this
> nice piece of work. I also observed interresting differences:
>
> The current xip stack does use the block device abstraction added with
> an extra direct memory access operation. The cramfs patch does not
> abstract between the memory segment and the file system, it rather
> uses an address as mount parameter.
> Both methods can coexist, and a file system that does not use the
> block device abstraction may still provide get_xip_page address space
> operation and benefit from the xip stack infrastructure. Each method
> makes sense in the context of the corresponding file systems.
>
> The current xip stack relies on having struct page behind the memory
> segment. This causes few impact on memory management, but occupies
> some more memory. The cramfs patch chose to modify copy on write in
> order to deal with vmas that don't have struct page behind.
> So far, Hugh and Linus have shown strong opposition against copy on
> write with no struct page behind. If this implementation is acceptable
> to the them, it seems preferable to me over wasting memory. The xip
> stack should be modified to use this vma flag in that case.
>
> This implementation extends cramfs while the current xip stack extends
> ext2. I would love to see a cramfs implementation based on
> filemap_xip, choice is always a good thing. To me, it looks like this
> work can be modified to use filemap_xip.
One question I have is the difference in the two models. If I understand
correctly the filemap_xip expects the ext2 to mount as XIP while the
Linear Cramfs only enables XIP for apps that have the sticky bit set.
Hence the Application XIP moniker.
Does the filemap_xip expect the entire filesystem is XIP?
Thanks,
Richard
> To me, it looks like this
> work can be modified to use filemap_xip.
How?
Richard Griffiths (wrs) wrote:
> One question I have is the difference in the two models. If I understand
> correctly the filemap_xip expects the ext2 to mount as XIP while the
> Linear Cramfs only enables XIP for apps that have the sticky bit set.
> Hence the Application XIP moniker.
> Does the filemap_xip expect the entire filesystem is XIP?
The file operations from filemap_xip are only effective if the struct
file_operations vector refers to them (see ext2_xip_file_operations in
fs/ext2/file.c as example). In oder to function proper, the
corresponding address_space_operations vector needs to provide an
implementation of get_xip_page (see ext2_get_xip_page in
fs/ext2/xip.c). Both vectors are per inode, one can chose to have XIP
for individual files in a file system.
so long,
Carsten
Jared Hulbert wrote:
>> To me, it looks like this
>> work can be modified to use filemap_xip.
>
> How?
First, get a struct page behind the subject file system image. A good
idea to get there is to use vmem_map, which allows have a
discontiguous mem_map array in virtual storage. Then add the memory
segment's address range to it. Sparsemem also helps, if vmem_map is
not available.
Second, implement get_xip_page for this file system. This address
space operation needs to be provided for all files that are supposed
to be XIP. readpage(s)/writepage(s) should not be provided for these
files, becuase they rely on using the page cache.
Third, use file operations from mm/filemap_xip.c.
so long,
Carsten
On 5/22/07, Richard Griffiths <[email protected]> wrote:
> Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
> the embedded Linux community for years, updated for 2.6.21. Tested on
> several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
> Freescale iMX31ADS.
>
When trying to verify this patch on our PXA270 system we get the
following error when running an XIP rootfs:
cramfs: checking physical address 0xa00000 for linear cramfs image
cramfs: linear cramfs image appears to be 3236 KB in size
VFS: Mounted root (cramfs filesystem) readonly.
Freeing init memory: 96K
/sbin/init: error while loading shared libraries: libgcc_s.so.1:
failed to map segment from shared object: Error 11
Kernel panic - not syncing: Attempted to kill init!
However, if our busybox binary is XIP while the libgcc_s.so.1 is not
XIP, busybox runs fine.
Richard, May I email you the rootfs tarball so you can recreate what
we are seeing? It is a little less than 2MiB. The filing system
executables will only run on a PXA27x processor.
On Thu, 2007-05-24 at 13:22 -0700, Jared Hulbert wrote:
> On 5/22/07, Richard Griffiths <[email protected]> wrote:
> > Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
> > the embedded Linux community for years, updated for 2.6.21. Tested on
> > several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
> > Freescale iMX31ADS.
> >
>
> When trying to verify this patch on our PXA270 system we get the
> following error when running an XIP rootfs:
>
> cramfs: checking physical address 0xa00000 for linear cramfs image
> cramfs: linear cramfs image appears to be 3236 KB in size
> VFS: Mounted root (cramfs filesystem) readonly.
> Freeing init memory: 96K
> /sbin/init: error while loading shared libraries: libgcc_s.so.1:
> failed to map segment from shared object: Error 11
> Kernel panic - not syncing: Attempted to kill init!
>
> However, if our busybox binary is XIP while the libgcc_s.so.1 is not
> XIP, busybox runs fine.
> Richard, May I email you the rootfs tarball so you can recreate what
> we are seeing? It is a little less than 2MiB. The filing system
> executables will only run on a PXA27x processor.
Yes you can, but I won't have access to a PXA270 for a few weeks. I
assume you don't see the issue if you static link busybox? BTW I'm using
uclibc rather than glibc - much smaller fs.
Richard
> Yes you can, but I won't have access to a PXA270 for a few weeks. I
> assume you don't see the issue if you static link busybox?
I don't know.
Carsten Otte wrote:
> The current xip stack relies on having struct page behind the memory
> segment. This causes few impact on memory management, but occupies some
> more memory. The cramfs patch chose to modify copy on write in order to
> deal with vmas that don't have struct page behind.
> So far, Hugh and Linus have shown strong opposition against copy on
> write with no struct page behind. If this implementation is acceptable
> to the them, it seems preferable to me over wasting memory. The xip
> stack should be modified to use this vma flag in that case.
I would rather not :P
We can copy on write without a struct page behind the source today, no?
What is insufficient for the XIP code with the current COW?
--
SUSE Labs, Novell Inc.
> > The current xip stack relies on having struct page behind the memory
> > segment. This causes few impact on memory management, but occupies some
> > more memory. The cramfs patch chose to modify copy on write in order to
> > deal with vmas that don't have struct page behind.
> > So far, Hugh and Linus have shown strong opposition against copy on
> > write with no struct page behind. If this implementation is acceptable
> > to the them, it seems preferable to me over wasting memory. The xip
> > stack should be modified to use this vma flag in that case.
>
> I would rather not :P
>
> We can copy on write without a struct page behind the source today, no?
The existing COW techniques fail on some corner cases. I'm not up to
speed on the vm code. I'll try to look into this a little more but it
might be useful if I knew what questions I need to answer so you vm
experts can understand the problem.
Let me give one example. If you try to debug an XIP application
without this patch, bad things happen. XIP in this sense is synomous
with executing directly out of Flash and you can't just change the
physical memory to redirect it to the debugger so easily in Flash.
Now I don't know exactly why yet some, but not all applications,
trigger this added vm hack. I'm not sure exactly why it would get
triggered under normal circumstances. Why would a read-only map get
written to?
> What is insufficient for the XIP code with the current COW?
So I think the problem may have something to do with the nature of the
memory in question. We are using Flash that is ioremap()'ed to a
usable virtual address. And yet we go on to try to use it as if it
were plain old system memory, like any RAM page. We need it to be
presented as any other memory page only physically read-only.
ioremap() seems to be a hacky way of accomplishing that, but I can't
think of better way. In ARM we even had to invent ioremap_cached() to
improve performance. Thoughts?
Jared Hulbert wrote:
>> > The current xip stack relies on having struct page behind the memory
>> > segment. This causes few impact on memory management, but occupies some
>> > more memory. The cramfs patch chose to modify copy on write in order to
>> > deal with vmas that don't have struct page behind.
>> > So far, Hugh and Linus have shown strong opposition against copy on
>> > write with no struct page behind. If this implementation is acceptable
>> > to the them, it seems preferable to me over wasting memory. The xip
>> > stack should be modified to use this vma flag in that case.
>>
>> I would rather not :P
>>
>> We can copy on write without a struct page behind the source today, no?
>
>
> The existing COW techniques fail on some corner cases. I'm not up to
> speed on the vm code. I'll try to look into this a little more but it
> might be useful if I knew what questions I need to answer so you vm
> experts can understand the problem.
Previously I believe we couldn't do COW without a struct page for the
source memory, nor could we COW with a source that is not readable
from the kernel virtual mapping.
Now we can do both. cow_user_page in mm/memory.c does a copy_from_user
if there is no source page, so it uses the user mappings and does not
require a struct page.
The question is, why is that not enough (I haven't looked at these
patches enough to work out if there is anything more they provide).
>
> Let me give one example. If you try to debug an XIP application
> without this patch, bad things happen. XIP in this sense is synomous
> with executing directly out of Flash and you can't just change the
> physical memory to redirect it to the debugger so easily in Flash.
> Now I don't know exactly why yet some, but not all applications,
> trigger this added vm hack. I'm not sure exactly why it would get
> triggered under normal circumstances. Why would a read-only map get
> written to?
>
>> What is insufficient for the XIP code with the current COW?
>
>
> So I think the problem may have something to do with the nature of the
> memory in question. We are using Flash that is ioremap()'ed to a
> usable virtual address. And yet we go on to try to use it as if it
> were plain old system memory, like any RAM page. We need it to be
> presented as any other memory page only physically read-only.
> ioremap() seems to be a hacky way of accomplishing that, but I can't
> think of better way. In ARM we even had to invent ioremap_cached() to
> improve performance. Thoughts?
>
--
SUSE Labs, Novell Inc.
Nick Piggin wrote:
> The question is, why is that not enough (I haven't looked at these
> patches enough to work out if there is anything more they provide).
I think, it just takes trying things out. From reading the code, I
think this should work well for the filemap_xip code with no struct page.
Also, we need eliminate nopage() to get rid of the struct page.
Unfortunately I don't find time to try this out for now, and on 390 we
can live with struct page for the time being. In contrast to the
embedded platforms, the mem_mep array gets swapped out to disk by our
hypervisor.
so long,
Carsten
On 6/4/07, Carsten Otte <[email protected]> wrote:
> Nick Piggin wrote:
> > The question is, why is that not enough (I haven't looked at these
> > patches enough to work out if there is anything more they provide).
> I think, it just takes trying things out. From reading the code, I
> think this should work well for the filemap_xip code with no struct page.
> Also, we need eliminate nopage() to get rid of the struct page.
> Unfortunately I don't find time to try this out for now, and on 390 we
> can live with struct page for the time being. In contrast to the
> embedded platforms, the mem_mep array gets swapped out to disk by our
> hypervisor.
>
> so long,
> Carsten
Okay, so here is a patch that implements the filemap_xip.c among other
things such as coexistence of block and linear addressing mounts and
enabling reading of XIP binaries from block. I've been messing with
those patches for a long time and I couldn't stand all the #ifdefs any
longer :)
I have two system:
(1) UML which populates cramfs_sb_info.linear_virt_addr with find_iomem().
(2) PXA270 which populates cramfs_sb_info.linear_virt_addr is
populated with ioremap()
I run two test:
(A) 'cmp /mnt/cramfs/bin/busybox /bin/busybox'
(B) execute '/mnt/cramfs/bin/busybox'
On both systems (A) works fine. (B) works on (1) but not on (2).
(2) failed with the following messages. (This wasn't really busybox.
It was xxd, not statically link, hence the issue with ld.so)
Inconsistency detected by ld.so: rtld.cBad pte = a4c6d05f, process =
???, vm_flags = 1875, vaddr = 8000: 1202: dl_main: Assertion
`_rtld_local._dl_rtld_map.l_libname' [<c0024ca8>] failed!
(dump_stack+0x0/0x14) from [<c0066b28>] (print_bad_pte+0x54/0x64)
[<c0066ad4>] (print_bad_pte+0x0/0x64) from [<c0066ba0>]
(vm_normal_page+0x68/0xb4)
r4 = 00008000
[<c0066b38>] (vm_normal_page+0x0/0xb4) from [<c0067498>]
(unmap_vmas+0x4bc/0x660)
r5 = 00008000 r4 = C3806820
[<c0066fdc>] (unmap_vmas+0x0/0x660) from [<c006c9a0>] (exit_mmap+0x68/0x124)
[<c006c938>] (exit_mmap+0x0/0x124) from [<c0034320>] (mmput+0x40/0xc0)
r8 = 00007F00 r7 = C0336040 r6 = C3ECB3D4 r5 = C0336040
r4 = C3ECB3A0
[<c00342e0>] (mmput+0x0/0xc0) from [<c0038998>] (exit_mm+0x80/0xe0)
r4 = C3ECB3A0
[<c0038918>] (exit_mm+0x0/0xe0) from [<c0038f10>] (do_exit+0xf4/0x82c)
r6 = 00000001 r5 = 40021910 r4 = 00007F00
[<c0038e1c>] (do_exit+0x0/0x82c) from [<c00396fc>] (do_group_exit+0x78/0x9c)
[<c0039684>] (do_group_exit+0x0/0x9c) from [<c0039738>]
(sys_exit_group+0x18/0x1c)
r4 = 40016F70
[<c0039720>] (sys_exit_group+0x0/0x1c) from [<c0020c00>]
(ret_fast_syscall+0x0/0x2c)
So what's going on?
diff -r 5114d0f3b003 -r 83e92f97054c fs/Kconfig
--- a/fs/Kconfig Tue Jun 05 11:37:21 2007 -0700
+++ b/fs/Kconfig Wed Jun 06 03:27:31 2007 -0700
@@ -65,7 +65,6 @@ config FS_XIP
config FS_XIP
# execute in place
bool
- depends on EXT2_FS_XIP
default y
config EXT3_FS
@@ -1368,6 +1367,40 @@ config CRAMFS
To compile this as a module, choose M here: the module will be called
cramfs. Note that the root file system (the one containing the
directory /) cannot be compiled as a module.
+
+ The CramFs driver can load data directly from a linear addressed memory
+ range (usually non volatile memory like flash) instead of going through
+ the block device layer. This saves some memory since no intermediate
+ buffering is necessary.
+
+ The location of the CramFs image in memory is board dependent.
+ Therefore you must know the proper physical address where to store
+ the CramFs image and specify it using the physaddr=0x******** mount
+ option. With UML kernels iomem=***** can be used instead after
+ having passed the kernel a parameter such as 'iomem=bob,cramfs'.
+ (for example: "mount -t cramfs -o physaddr=0x100000 none /mnt" or
+ "mount -t cramfs -o iomem=bob none /mnt")
+
+ To mount as the root pass the command line parameter
+ "root=/dev/null rootflags=physaddr=0x********" or
+ "root=/dev/null rootflags=iomem=*****" to the kernel
+
+ Replace 0x******** with the physical address location of the linear
+ CramFs image to boot with.
+
+config CRAMFS_XIP
+ bool "Support XIP on cramfs"
+ depends on FS_XIP
+ help
+ You must say Y to this option if you want to be able to run
+ applications directly from non-volatile memory. XIP
+ applications are marked by setting the sticky bit (ie, "chmod
+ +t <app name>"). A CramFs file system then needs to be
+ created using mkcramfs (with XIP cramfs support in
+ it). Applications marked for XIP execution will not be
+ compressed since they have to run directly from flash.
+
+ This will also allow mounting an XIP binary on a block device.
If unsure, say N.
diff -r 5114d0f3b003 -r 83e92f97054c fs/cramfs/inode.c
--- a/fs/cramfs/inode.c Tue Jun 05 11:37:21 2007 -0700
+++ b/fs/cramfs/inode.c Wed Jun 06 03:27:31 2007 -0700
@@ -9,6 +9,46 @@
/*
* These are the VFS interfaces to the compressed rom filesystem.
* The actual compression is based on zlib, see the other files.
+ */
+
+/* Linear Addressing code
+ *
+ * Copyright (C) 2000 Shane Nay.
+ *
+ * Allows you to have a linearly addressed cramfs filesystem.
+ * Saves the need for buffer, and the munging of the buffer.
+ * Savings a bit over 32k with default PAGE_SIZE, BUFFER_SIZE
+ * etc. Usefull on embedded platform with ROM :-).
+ *
+ * Downsides- Currently linear addressed cramfs partitions
+ * don't co-exist with block cramfs partitions.
+ *
+ */
+
+/*
+ * 28-Dec-2000: XIP mode for linear cramfs
+ * Copyright (C) 2000 Robert Leslie <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+/* filemap_xip.c interfaces - Jared Hulbert 2007
+ * linear + block coexisting - Jared Hulbert 2007
+ * (inspired by patches from Kyungmin Park of Samsung and others at
+ * Motorola from the EZX phones)
+ *
*/
#include <linux/module.h>
@@ -24,22 +64,25 @@
#include <linux/vfs.h>
#include <linux/mutex.h>
#include <asm/semaphore.h>
+#include <linux/vmalloc.h>
#include <asm/uaccess.h>
-
-static const struct super_operations cramfs_ops;
-static const struct inode_operations cramfs_dir_inode_operations;
+#include <asm/tlbflush.h>
+#ifdef CONFIG_UML
+#include <mem_user.h>
+#endif
+
+static struct super_operations cramfs_ops;
+static struct inode_operations cramfs_dir_inode_operations;
static const struct file_operations cramfs_directory_operations;
static const struct address_space_operations cramfs_aops;
static DEFINE_MUTEX(read_mutex);
-
/* These two macros may change in future, to provide better st_ino
semantics. */
#define CRAMINO(x) (((x)->offset && (x)->size)?(x)->offset<<2:1)
#define OFFSET(x) ((x)->i_ino)
-
static int cramfs_iget5_test(struct inode *inode, void *opaque)
{
@@ -99,13 +142,77 @@ static int cramfs_iget5_set(struct inode
return 0;
}
+static int cramfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
+
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+ return -EINVAL;
+
+ if ((CRAMFS_INODE_IS_XIP(inode)) && !(vma->vm_flags & VM_WRITE) &&
+ (LINEAR_CRAMFS(sbi)))
+ return xip_file_mmap(file, vma);
+
+ return generic_file_mmap(file, vma);
+}
+
+struct page *cramfs_get_xip_page(struct address_space *mapping,
sector_t offset,
+ int create)
+{
+ unsigned long address;
+ unsigned long offs = offset;
+ struct inode *inode = mapping->host;
+ struct super_block *sb = inode->i_sb;
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+ address = PAGE_ALIGN((unsigned long)(sbi->linear_virt_addr
+ + OFFSET(inode)));
+ offs *= 512; /* FIXME -- This shouldn't be hard coded */
+ address += offs;
+
+ return virt_to_page(address);
+}
+
+ssize_t cramfs_file_read(struct file *file, char __user * buf, size_t len,
+ loff_t * ppos)
+{
+ struct inode *inode = file->f_dentry->d_inode;
+ struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
+
+ if ((CRAMFS_INODE_IS_XIP(inode)) && (LINEAR_CRAMFS(sbi)))
+ return xip_file_read(file, buf, len, ppos);
+
+ return do_sync_read(file, buf, len, ppos);
+}
+
+static struct file_operations cramfs_linear_xip_fops = {
+ aio_read: generic_file_aio_read,
+ read: cramfs_file_read,
+ mmap: cramfs_mmap,
+};
+
+static struct backing_dev_info cramfs_backing_dev_info = {
+ .ra_pages = 0, /* No readahead */
+};
+
static struct inode *get_cramfs_inode(struct super_block *sb,
struct cramfs_inode * cramfs_inode)
{
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
struct inode *inode = iget5_locked(sb, CRAMINO(cramfs_inode),
cramfs_iget5_test, cramfs_iget5_set,
cramfs_inode);
if (inode && (inode->i_state & I_NEW)) {
+ if (LINEAR_CRAMFS(sbi))
+ inode->i_mapping->backing_dev_info =
+ &cramfs_backing_dev_info;
+
+#ifdef CONFIG_CRAMFS_XIP
+ if(S_ISREG(inode->i_mode))
+ inode->i_fop = CRAMFS_INODE_IS_XIP(inode) ?
+ &cramfs_linear_xip_fops : &generic_ro_fops;
+#endif
unlock_new_inode(inode);
}
return inode;
@@ -135,16 +242,18 @@ static struct inode *get_cramfs_inode(st
#define BLKS_PER_BUF (1 << BLKS_PER_BUF_SHIFT)
#define BUFFER_SIZE (BLKS_PER_BUF*PAGE_CACHE_SIZE)
-static unsigned char read_buffers[READ_BUFFERS][BUFFER_SIZE];
-static unsigned buffer_blocknr[READ_BUFFERS];
-static struct super_block * buffer_dev[READ_BUFFERS];
+static unsigned char **read_buffers = NULL;
+static unsigned *buffer_blocknr = NULL;
+static struct super_block **buffer_dev = NULL;
static int next_buffer;
+static int block_filesystems = 0;
/*
* Returns a pointer to a buffer containing at least LEN bytes of
* filesystem starting at byte offset OFFSET into the filesystem.
*/
-static void *cramfs_read(struct super_block *sb, unsigned int offset,
unsigned int len)
+static void *cramfs_read_block(struct super_block *sb, unsigned int offset,
+ unsigned int len)
{
struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
struct page *pages[BLKS_PER_BUF];
@@ -219,8 +328,37 @@ static void *cramfs_read(struct super_bl
return read_buffers[buffer] + offset;
}
+static void *cramfs_read(struct super_block *sb, unsigned int offset,
+ unsigned int len)
+{
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+ if (LINEAR_CRAMFS(sbi)) {
+ if (!len)
+ return NULL;
+ return (void*)(sbi->linear_virt_addr + offset);
+ }
+
+ return cramfs_read_block(sb, offset, len);
+}
+
static void cramfs_put_super(struct super_block *sb)
{
+ int i;
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+ if (!LINEAR_CRAMFS(sbi)) {
+ if (!--block_filesystems) {
+ for (i = 0; i < READ_BUFFERS; i++) {
+ if (read_buffers[i])
+ vfree(read_buffers[i]);
+ }
+ vfree(read_buffers);
+ vfree(buffer_blocknr);
+ vfree(buffer_dev);
+ }
+ }
+
kfree(sb->s_fs_info);
sb->s_fs_info = NULL;
}
@@ -231,26 +369,13 @@ static int cramfs_remount(struct super_b
return 0;
}
-static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
-{
- int i;
+static int cramfs_check_super(struct super_block *sb, int silent)
+{
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
struct cramfs_super super;
unsigned long root_offset;
- struct cramfs_sb_info *sbi;
struct inode *root;
-
- sb->s_flags |= MS_RDONLY;
-
- sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
- if (!sbi)
- return -ENOMEM;
- sb->s_fs_info = sbi;
-
- /* Invalidate the read buffers on mount: think disk change.. */
- mutex_lock(&read_mutex);
- for (i = 0; i < READ_BUFFERS; i++)
- buffer_blocknr[i] = -1;
-
+
/* Read the first block and get the superblock from it */
memcpy(&super, cramfs_read(sb, 0, sizeof(super)), sizeof(super));
mutex_unlock(&read_mutex);
@@ -259,7 +384,8 @@ static int cramfs_fill_super(struct supe
if (super.magic != CRAMFS_MAGIC) {
/* check at 512 byte offset */
mutex_lock(&read_mutex);
- memcpy(&super, cramfs_read(sb, 512, sizeof(super)), sizeof(super));
+ memcpy(&super, cramfs_read(sb, 512, sizeof(super)),
+ sizeof(super));
mutex_unlock(&read_mutex);
if (super.magic != CRAMFS_MAGIC) {
if (!silent)
@@ -311,11 +437,170 @@ static int cramfs_fill_super(struct supe
iput(root);
goto out;
}
+
+ printk(KERN_INFO "cramfs: cramfs image appears to be %lu KB in size\n",
+ sbi->size/1024);
+
+ return 0;
+out:
+ return -EINVAL;
+}
+
+static int cramfs_fill_super_physaddr(struct super_block *sb, void *data,
+ int silent)
+{
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+ char *p;
+ int ret = -EINVAL;
+
+ if (!(p = strstr((char *)data, "physaddr=")))
+ goto out;
+
+ sbi->linear_phys_addr = simple_strtoul(p + 9, NULL, 0);
+ if (sbi->linear_phys_addr & (PAGE_SIZE-1)) {
+ printk(KERN_ERR "cramfs: physical address 0x%lx for linear"
+ "cramfs isn't aligned to a page boundary\n",
+ sbi->linear_phys_addr);
+ goto out;
+ }
+ if (sbi->linear_phys_addr == 0) {
+ printk(KERN_ERR "cramfs: physical address for linear cramfs"
+ "image can't be 0\n");
+ goto out;
+ }
+ printk(KERN_INFO "cramfs: checking physical address 0x%lx for linear"
+ "cramfs image\n", sbi->linear_phys_addr);
+
+ /* Map only one page for now. Will remap it when fs size is known. */
+ sbi->linear_virt_addr =
+ ioremap(sbi->linear_phys_addr, PAGE_SIZE);
+ if (!sbi->linear_virt_addr) {
+ printk(KERN_ERR "cramfs: ioremap of the linear cramfs image"
+ "failed\n");
+ goto out;
+ }
+ mutex_lock(&read_mutex);
+ ret = cramfs_check_super(sb, silent);
+ if (ret != 0)
+ goto out;
+
+ /* Remap the whole filesystem now */
+ iounmap(sbi->linear_virt_addr);
+ sbi->linear_virt_addr =
+ CRAMFS_REMAP(sbi->linear_phys_addr, sbi->size);
+
+ if (!sbi->linear_virt_addr) {
+ printk(KERN_ERR "cramfs: ioremap of the linear cramfs image"
+ " failed\n");
+ goto out;
+ }
+
+ return 0;
+out:
+ if (sbi->linear_virt_addr)
+ iounmap(sbi->linear_virt_addr);
+ return ret;
+}
+
+static int cramfs_fill_super_iomem(struct super_block *sb, void *data,
+ int silent)
+{
+#ifdef CONFIG_UML
+ /* this allows users to mount with a -o iomem=bob from UML kernels */
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+ char *p;
+ int ret = -EINVAL;
+ unsigned long len, virtaddr;
+
+ if (!(p = strstr((char *)data, "iomem=")))
+ goto out;
+
+ sbi->linear_phys_addr = 0;
+ virtaddr = find_iomem(p + 6, &len);
+ sbi->linear_virt_addr = (void *)virtaddr;
+
+ if ((unsigned long)(sbi->linear_virt_addr) & (PAGE_SIZE-1)) {
+ printk(KERN_ERR "cramfs: virtual address 0x%lx for linear"
+ " cramfs isn't aligned to a page boundary\n",
+ sbi->linear_phys_addr);
+ goto out;
+ }
+ if ((unsigned long)(sbi->linear_virt_addr) == 0) {
+ printk(KERN_ERR "cramfs: virtual address for linear cramfs"
+ " image can't be 0\n");
+ goto out;
+ }
+ printk(KERN_INFO "cramfs: checking virtual address 0x%lx for"
+ " linear cramfs image\n",
+ (unsigned long)sbi->linear_virt_addr);
+ mutex_lock(&read_mutex);
+ ret = cramfs_check_super(sb, silent);
+ if (ret != 0)
+ goto out;
+
+ return 0;
+out:
+ return ret;
+#else
+ return -EINVAL;
+#endif
+}
+
+static int cramfs_fill_super_block(struct super_block *sb, int silent)
+{
+ int i;
+ int ret = -EINVAL;
+
+ printk("fill_super_block\n");
+ mutex_lock(&read_mutex);
+ if (!block_filesystems++) {
+ read_buffers = vmalloc(sizeof(*read_buffers) * READ_BUFFERS);
+ for (i = 0; i < READ_BUFFERS; i++) {
+ read_buffers[i] = vmalloc(sizeof(**read_buffers)
+ * BUFFER_SIZE);
+ }
+ buffer_blocknr = vmalloc(sizeof(*buffer_blocknr)
+ * READ_BUFFERS);
+ buffer_dev = vmalloc(sizeof(*buffer_dev) * READ_BUFFERS);
+ }
+ /* Invalidate the read buffers on mount: think disk change.. */
+ for (i = 0; i < READ_BUFFERS; i++)
+ buffer_blocknr[i] = -1;
+ ret = cramfs_check_super(sb, silent);
+ if (ret != 0)
+ goto out;
+
+ return 0;
+out:
+ return ret;
+}
+
+static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
+{
+ int ret = 0;
+ struct cramfs_sb_info *sbi;
+
+ sb->s_flags |= MS_RDONLY;
+
+ sbi = kzalloc(sizeof(struct cramfs_sb_info), GFP_KERNEL);
+ if (!sbi)
+ return -ENOMEM;
+ sb->s_fs_info = sbi;
+
+ ret = cramfs_fill_super_physaddr(sb, data, silent);
+ ret = cramfs_fill_super_iomem(sb, data, silent);
+
+ if (!LINEAR_CRAMFS(sbi))
+ ret = cramfs_fill_super_block(sb, silent);
+
+ if (ret)
+ goto out;
+
return 0;
out:
kfree(sbi);
sb->s_fs_info = NULL;
- return -EINVAL;
+ return ret;
}
static int cramfs_statfs(struct dentry *dentry, struct kstatfs *buf)
@@ -404,7 +689,8 @@ static int cramfs_readdir(struct file *f
/*
* Lookup and fill in the inode data..
*/
-static struct dentry * cramfs_lookup(struct inode *dir, struct dentry
*dentry, struct nameidata *nd)
+static struct dentry * cramfs_lookup(struct inode *dir, struct dentry *dentry,
+ struct nameidata *nd)
{
unsigned int offset = 0;
int sorted;
@@ -416,7 +702,8 @@ static struct dentry * cramfs_lookup(str
char *name;
int namelen, retval;
- de = cramfs_read(dir->i_sb, OFFSET(dir) + offset, sizeof(*de)+256);
+ de = cramfs_read(dir->i_sb, OFFSET(dir) + offset,
+ sizeof(*de)+256);
name = (char *)(de+1);
/* Try to take advantage of sorted directories */
@@ -469,28 +756,47 @@ static int cramfs_readpage(struct file *
bytes_filled = 0;
if (page->index < maxblock) {
struct super_block *sb = inode->i_sb;
- u32 blkptr_offset = OFFSET(inode) + page->index*4;
- u32 start_offset, compr_len;
-
- start_offset = OFFSET(inode) + maxblock*4;
+ u32 blkptr_off = OFFSET(inode) + page->index*4;
+ u32 start_off, compr_len;
+
+#ifdef CONFIG_CRAMFS_XIP
+ if(CRAMFS_INODE_IS_XIP(inode)) {
+ blkptr_off =
+ PAGE_ALIGN(OFFSET(inode)) +
+ page->index * PAGE_CACHE_SIZE;
+ mutex_lock(&read_mutex);
+ memcpy(page_address(page),
+ cramfs_read(sb, blkptr_off, PAGE_CACHE_SIZE),
+ PAGE_CACHE_SIZE);
+ mutex_unlock(&read_mutex);
+ bytes_filled = PAGE_CACHE_SIZE;
+ pgdata = kmap(page);
+ } else {
+#endif /* CONFIG_CRAMFS_XIP */
+ start_off = OFFSET(inode) + maxblock*4;
mutex_lock(&read_mutex);
if (page->index)
- start_offset = *(u32 *) cramfs_read(sb, blkptr_offset-4, 4);
- compr_len = (*(u32 *) cramfs_read(sb, blkptr_offset, 4) - start_offset);
+ start_off = *(u32 *) cramfs_read(sb, blkptr_off-4, 4);
+ compr_len = (*(u32 *) cramfs_read(sb, blkptr_off, 4)
+ - start_off);
mutex_unlock(&read_mutex);
pgdata = kmap(page);
if (compr_len == 0)
; /* hole */
else if (compr_len > (PAGE_CACHE_SIZE << 1))
- printk(KERN_ERR "cramfs: bad compressed blocksize %u\n", compr_len);
+ printk(KERN_ERR "cramfs: bad compressed blocksize"
+ " %u\n", compr_len);
else {
mutex_lock(&read_mutex);
bytes_filled = cramfs_uncompress_block(pgdata,
PAGE_CACHE_SIZE,
- cramfs_read(sb, start_offset, compr_len),
+ cramfs_read(sb, start_off, compr_len),
compr_len);
mutex_unlock(&read_mutex);
}
+#ifdef CONFIG_CRAMFS_XIP
+ }
+#endif /* CONFIG_CRAMFS_XIP */
} else
pgdata = kmap(page);
memset(pgdata + bytes_filled, 0, PAGE_CACHE_SIZE - bytes_filled);
@@ -502,7 +808,8 @@ static int cramfs_readpage(struct file *
}
static const struct address_space_operations cramfs_aops = {
- .readpage = cramfs_readpage
+ .readpage = cramfs_readpage,
+ .get_xip_page = cramfs_get_xip_page,
};
/*
@@ -518,11 +825,11 @@ static const struct file_operations cram
.readdir = cramfs_readdir,
};
-static const struct inode_operations cramfs_dir_inode_operations = {
+static struct inode_operations cramfs_dir_inode_operations = {
.lookup = cramfs_lookup,
};
-static const struct super_operations cramfs_ops = {
+static struct super_operations cramfs_ops = {
.put_super = cramfs_put_super,
.remount_fs = cramfs_remount,
.statfs = cramfs_statfs,
@@ -531,16 +838,31 @@ static int cramfs_get_sb(struct file_sys
static int cramfs_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
+ if ((strstr((char *)data, "iomem=")) ||
+ (strstr((char *)data, "physaddr=")))
+ return get_sb_nodev(fs_type, flags, data, cramfs_fill_super,
+ mnt);
+
return get_sb_bdev(fs_type, flags, dev_name, data, cramfs_fill_super,
- mnt);
+ mnt);
+}
+
+void cramfs_kill_super(struct super_block *sb)
+{
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+ if (LINEAR_CRAMFS(sbi)) {
+ kill_anon_super(sb);
+ } else {
+ kill_block_super(sb);
+ }
}
static struct file_system_type cramfs_fs_type = {
.owner = THIS_MODULE,
.name = "cramfs",
.get_sb = cramfs_get_sb,
- .kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .kill_sb = cramfs_kill_super,
};
static int __init init_cramfs_fs(void)
@@ -550,6 +872,7 @@ static int __init init_cramfs_fs(void)
rv = cramfs_uncompress_init();
if (rv < 0)
return rv;
+
rv = register_filesystem(&cramfs_fs_type);
if (rv < 0)
cramfs_uncompress_exit();
diff -r 5114d0f3b003 -r 83e92f97054c include/linux/cramfs_fs.h
--- a/include/linux/cramfs_fs.h Tue Jun 05 11:37:21 2007 -0700
+++ b/include/linux/cramfs_fs.h Wed Jun 06 03:27:31 2007 -0700
@@ -84,6 +84,16 @@ struct cramfs_super {
| CRAMFS_FLAG_WRONG_SIGNATURE \
| CRAMFS_FLAG_SHIFTED_ROOT_OFFSET )
+#define LINEAR_CRAMFS(x) \
+(((x)->linear_phys_addr != 0) || ((x)->linear_virt_addr))
+#define CRAMFS_INODE_IS_XIP(x) ((x)->i_mode & S_ISVTX)
+
+#ifdef ioremap_cached
+#define CRAMFS_REMAP ioremap_cached
+#else
+#define CRAMFS_REMAP ioremap
+#endif
+
/* Uncompression interfaces to the underlying zlib */
int cramfs_uncompress_block(void *dst, int dstlen, void *src, int srclen);
int cramfs_uncompress_init(void);
diff -r 5114d0f3b003 -r 83e92f97054c include/linux/cramfs_fs_sb.h
--- a/include/linux/cramfs_fs_sb.h Tue Jun 05 11:37:21 2007 -0700
+++ b/include/linux/cramfs_fs_sb.h Wed Jun 06 03:27:31 2007 -0700
@@ -10,6 +10,8 @@ struct cramfs_sb_info {
unsigned long blocks;
unsigned long files;
unsigned long flags;
+ unsigned long linear_phys_addr;
+ void __iomem *linear_virt_addr;
};
static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
@@ -17,4 +19,16 @@ static inline struct cramfs_sb_info *CRA
return sb->s_fs_info;
}
+#ifdef CONFIG_UML
+/* These are just to help the UML version build */
+static inline void __iomem * ioremap(unsigned long offset, unsigned long size)
+{
+ return 0;
+}
+
+static inline void iounmap(volatile void __iomem *addr)
+{
+}
#endif
+
+#endif
diff -r 5114d0f3b003 -r 83e92f97054c init/do_mounts.c
--- a/init/do_mounts.c Tue Jun 05 11:37:21 2007 -0700
+++ b/init/do_mounts.c Wed Jun 06 03:27:31 2007 -0700
@@ -343,6 +343,15 @@ static int __init mount_nfs_root(void)
}
#endif
+static int __init mount_cramfs_linear_root(void)
+{
+ create_dev("/dev/root", ROOT_DEV);
+ if (do_mount_root("/dev/root","cramfs",root_mountflags,
+ root_mount_data) == 0)
+ return 1;
+ return 0;
+}
+
#if defined(CONFIG_BLK_DEV_RAM) || defined(CONFIG_BLK_DEV_FD)
void __init change_floppy(char *fmt, ...)
{
@@ -375,6 +384,11 @@ void __init change_floppy(char *fmt, ...
void __init mount_root(void)
{
+ if (ROOT_DEV == MKDEV(0, 0)) {
+ if (mount_cramfs_linear_root())
+ return;
+ printk (KERN_ERR "VFS: Unable to mount linear cramfs root.\n");
+ }
#ifdef CONFIG_ROOT_NFS
if (MAJOR(ROOT_DEV) == UNNAMED_MAJOR) {
if (mount_nfs_root())
I might be a little late in the discussion, but I somehow missed this
before. Please don't add this xip support to cramfs, because the
whole point of cramfs is to be a simple _compressed_ filesystem,
and we really don't want to add more complexity to it. Please
use something like the existing ext2 xip mode instead of add support
to romfs using the generic filemap methods.
Jared Hulbert wrote:
> (2) failed with the following messages. (This wasn't really busybox.
> It was xxd, not statically link, hence the issue with ld.so)
Could you try to figure what happend to subject page before? Was it
subject to copy on write? With what flags has this vma been mmaped?
thanks,
Carsten
On Wed, 2007-06-06 at 12:33 +0100, Christoph Hellwig wrote:
> I might be a little late in the discussion, but I somehow missed this
> before. Please don't add this xip support to cramfs, because the
> whole point of cramfs is to be a simple _compressed_ filesystem,
> and we really don't want to add more complexity to it. Please
> use something like the existing ext2 xip mode instead of add support
> to romfs using the generic filemap methods.
Too late :) The XIP cramfs patch is widely used in the embedded Linux
community and has been used for years. It fulfills a need for a small
XIP Flash file system. Hence our interest in getting it or some
variation into the mainline kernel.
On Wed, Jun 06, 2007 at 08:17:43AM -0700, Richard Griffiths wrote:
> Too late :) The XIP cramfs patch is widely used in the embedded Linux
> community and has been used for years. It fulfills a need for a small
> XIP Flash file system. Hence our interest in getting it or some
> variation into the mainline kernel.
That's not a reason to put it in as-is. Maybe you should have showed
up here before making the wrong decision to hack this into cramfs.
On 6/6/07, Christoph Hellwig <[email protected]> wrote:
> I might be a little late in the discussion, but I somehow missed this
> before. Please don't add this xip support to cramfs, because the
> whole point of cramfs is to be a simple _compressed_ filesystem,
> and we really don't want to add more complexity to it.
I estimate something on the order 5-10 million Linux phones use
something similar to these patches. I wonder if there are that many
provable users of of the simple cramfs. This is where the community
has taken cramfs.
Nevertheless, I understand your point. I wrote AXFS in part because
the hacks required to do XIP on cramfs where ugly, hacky, and complex.
Please review the latest patch in the thread, it's just a draft but
the changes required are not very complex now, especially in light of
the filemap_xip.c APIs being used. It just happens not to work, yet.
> Please
> use something like the existing ext2 xip mode instead of add support
> to romfs using the generic filemap methods.
What?? You mean like use xip_file_mmap() and implement
get_xip_page()? Did you read my latest patch?
On 6/6/07, Christoph Hellwig <[email protected]> wrote:
> On Wed, Jun 06, 2007 at 08:17:43AM -0700, Richard Griffiths wrote:
> > Too late :) The XIP cramfs patch is widely used in the embedded Linux
> > community and has been used for years. It fulfills a need for a small
> > XIP Flash file system. Hence our interest in getting it or some
> > variation into the mainline kernel.
>
> That's not a reason to put it in as-is. Maybe you should have showed
> up here before making the wrong decision to hack this into cramfs.
>
Please read the entire thread before passing judgements like this.
The hacks evolved over the last 8 years, and are really handy. We're
just trying to figure the best way to share them..
Christoph Hellwig wrote:
> I might be a little late in the discussion, but I somehow missed this
> before. Please don't add this xip support to cramfs, because the
> whole point of cramfs is to be a simple _compressed_ filesystem,
> and we really don't want to add more complexity to it. Please
> use something like the existing ext2 xip mode instead of add support
> to romfs using the generic filemap methods.
The clear advantage of using cramfs on embedded platforms over using the
ext2 stuff is, that one can choose per-file whether it should be
compressed or xip.
The real key is, to put both our ext2 stuff and the cramfs xip on a
common infrastructure. They should use the same file operations and
adress space operations for xip files rather then replicating each
others bugs.
If cramfs shall be kept simple, it might be time to fork that file
system. I don't see that need arise from the proposed solution. It can
become clean and sane with a little work on it. Look at the xip
extensions for ext2 for example, they don't bloat the filesystem too much.
so long,
Carsten
On Wed, Jun 06, 2007 at 09:07:16AM -0700, Jared Hulbert wrote:
> I estimate something on the order 5-10 million Linux phones use
> something similar to these patches. I wonder if there are that many
> provable users of of the simple cramfs. This is where the community
> has taken cramfs.
This is what a community disjoint to mainline development has hacked
cramfs in their trees into. Not a good rationale. This whole
"but we've always done it" attitute is a little annoying, really.
FYI: Cartsten had an xip fs for s390 aswell, and that evolved into
the filemap.c bits after a lot of rework an quite a few round of
review.
> Nevertheless, I understand your point. I wrote AXFS in part because
> the hacks required to do XIP on cramfs where ugly, hacky, and complex.
I can't find a reference to AXFS anywhere in this thread.
> > Please
> >use something like the existing ext2 xip mode instead of add support
> >to romfs using the generic filemap methods.
>
> What?? You mean like use xip_file_mmap() and implement
> get_xip_page()? Did you read my latest patch?
Yes. This is the highlevel way to go, just please don't hack it into
cramfs.
On Wed, Jun 06, 2007 at 06:15:04PM +0200, Carsten Otte wrote:
> The clear advantage of using cramfs on embedded platforms over using the
> ext2 stuff is, that one can choose per-file whether it should be
> compressed or xip.
> The real key is, to put both our ext2 stuff and the cramfs xip on a
> common infrastructure. They should use the same file operations and
> adress space operations for xip files rather then replicating each
> others bugs.
> If cramfs shall be kept simple, it might be time to fork that file
> system. I don't see that need arise from the proposed solution. It can
> become clean and sane with a little work on it. Look at the xip
> extensions for ext2 for example, they don't bloat the filesystem too much.
ext2 is a multi-purpose block based filesystem and you add support for
another bit of storage. cramfs is very specialized to storing read-only
compressed data on block devices. The embedded people already use them
on flash which is a little dumb, but now we add even more cludge for
a non-block based access.
The right way to architect xip for flash-based devices is to implement
a generic get_xip_page for mtd-based devices and integrate that into
an existing flash filesystem or write a simple new flash filesystem
tailored to that use case.
> On Wed, Jun 06, 2007 at 09:07:16AM -0700, Jared Hulbert wrote:
> > I estimate something on the order 5-10 million Linux phones use
> > something similar to these patches. I wonder if there are that many
> > provable users of of the simple cramfs. This is where the community
> > has taken cramfs.
>
> This is what a community disjoint to mainline development has hacked
> cramfs in their trees into. Not a good rationale. This whole
> "but we've always done it" attitute is a little annoying, really.
It is that disjointedness we are trying to address.
> FYI: Cartsten had an xip fs for s390 aswell, and that evolved into
> the filemap.c bits after a lot of rework an quite a few round of
> review.
Right. So now we leverage this filemap_xip.c in cramfs. Why is this a problem?
> > Nevertheless, I understand your point. I wrote AXFS in part because
> > the hacks required to do XIP on cramfs where ugly, hacky, and complex.
>
> I can't find a reference to AXFS anywhere in this thread.
No, it's not here. There's a year old thread referencing it.
> > > Please
> > >use something like the existing ext2 xip mode instead of add support
> > >to romfs using the generic filemap methods.
> >
> > What?? You mean like use xip_file_mmap() and implement
> > get_xip_page()? Did you read my latest patch?
>
> Yes. This is the highlevel way to go, just please don't hack it into
> cramfs.
Right, so this latest patch _does_ implement get_xip_page() and
xip_file_mmap(). Why not hack it into cramfs?
> The embedded people already use them
> on flash which is a little dumb, but now we add even more cludge for
> a non-block based access.
Please justify your assertion that using cramfs on flash is dumb.
What would be not dumb? In an embedded system with addressable Flash
the linear addressing cramfs is simple and elegant solution.
Removing support for block based access would drastically reduce the
complexity of cramfs. The non-block access bits of code are trivial
in comparison. Specifically which part of my patch represents
unwarranted, unfixable cludge?
> The right way to architect xip for flash-based devices is to implement
> a generic get_xip_page for mtd-based devices and integrate that into
> an existing flash filesystem or write a simple new flash filesystem
> tailored to that use case.
There is often no need for the complexity of the MTD for a readonly
compressed filesystem in the embedded world. I am intrigued by the
suggestion of a generic get_xip_page() for mtd-based devices. I fail
to see how get_xip_page() is not highly filesystem dependant. How
might a generic one work?
On 6/6/07, Carsten Otte <[email protected]> wrote:
> Jared Hulbert wrote:
> > (2) failed with the following messages. (This wasn't really busybox.
> > It was xxd, not statically link, hence the issue with ld.so)
> Could you try to figure what happend to subject page before? Was it
> subject to copy on write? With what flags has this vma been mmaped?
>
> thanks,
> Carsten
The vma->flags = 1875 = 0x753
This is:
VM_READ
VM_WRITE
VM_MAYREAD
VM_MAYEXEC
VM_GROWSDOWN
VM_GROWSUP
VM_PFNMAP
I assume no struct page exists for the pages of this file. When
vm_no_page was called it seems it failed on a pte check since there is
no backing page structure.
On Wed, Jun 06, 2007 at 11:40:58AM -0700, Jared Hulbert wrote:
> > The embedded people already use them
> >on flash which is a little dumb, but now we add even more cludge for
> >a non-block based access.
>
> Please justify your assertion that using cramfs on flash is dumb.
> What would be not dumb? In an embedded system with addressable Flash
> the linear addressing cramfs is simple and elegant solution.
Have to agree with Jared here, cramfs is a perfectly sensible thing to
use on many platforms.
Adding the ability to make particular files XIP on those platforms is
also quite reasonable.
The alternative would be to add a whole new filesystem to the kernel
(effectively obsoleting cramfs) just to add XIP support or to compile
in a second filesystem (ext2 w/XIP) just for a few files.
Keeping cramfs as a simple example filesystem is really not all that
worthwhile, given it's not much of an example.
--
Mathematics is the supreme nostalgia of our time.
--- Jared Hulbert <[email protected]> wrote:
> The vma->flags = 1875 = 0x753
>
> This is:
> VM_READ
> VM_WRITE
> VM_MAYREAD
> VM_MAYEXEC
> VM_GROWSDOWN
> VM_GROWSUP
> VM_PFNMAP
>
There was a mistake in Jared's previous post in this
thread. The vm_flags were already in hex, i.e. 0x1875
The settings were:
VM_READ
VM_EXEC
VM_MAYREAD
VM_MAYWRITE
VM_MAYEXEC
VM_DENYWRITE
VM_EXECUTABLE
A possible problem source is that VM_PFNMAP is not set.
Thus when vm_normal_page is called there is no associated
struct page.
Justin Treon
____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/
Christoph Hellwig wrote:
> The right way to architect xip for flash-based devices is to implement
> a generic get_xip_page for mtd-based devices and integrate that into
> an existing flash filesystem or write a simple new flash filesystem
> tailored to that use case.
I've had a few beer long discussion with Joern Engel and David
Woodhouse on this one. To cut a long discussion short: the current XIP
infrastructure is not sufficient to be used on top of mtd. We'd need
some extenstions:
- on get_xip_page() we'd need to state if we want the reference
read-only or read+write
- we need a put_xip_page() to return references
- and finally we need a callback for the referece, so that the mtd
driver can ask to get its reference back (in order to unmap from
userland when erasing a block)
While I fully agree, that a flash filesystem using xip would be very
desireable, the proposed cramfs extension is a totaly different beast
that has its own value to me: the ability to select per file whether
xip or compression is more efficient.
On Wed, Jun 06, 2007 at 11:26:38AM -0700, Jared Hulbert wrote:
> >FYI: Cartsten had an xip fs for s390 aswell, and that evolved into
> >the filemap.c bits after a lot of rework an quite a few round of
> >review.
>
> Right. So now we leverage this filemap_xip.c in cramfs. Why is this a
> problem?
Using filemap_xip.c is of course good. Hacking cramfs up to contain
totally different codepathes is not.
> >> What?? You mean like use xip_file_mmap() and implement
> >> get_xip_page()? Did you read my latest patch?
> >
> >Yes. This is the highlevel way to go, just please don't hack it into
> >cramfs.
>
> Right, so this latest patch _does_ implement get_xip_page() and
> xip_file_mmap(). Why not hack it into cramfs?
cramfs used to be a block based filesystem. You hack in support to
work directly on physical addresses. Your create a lot of different
codepathes that are only used for one of the usecases.
Instead of just talking here I've hacked up a small demo on how this
should be approached. I've taken cramfs with your modification and
remove all the support for block devices. In addition we now could
rip out the mutex serializing all reads and clean up the way xip
support is hooked in. The result is a slightly smaller and cleaner
filesystem that even more important doesn't require pulling in
the whole block layer which is especially important for embedded
devices at the lower end of the scala. This is just a proof of
concept and thus is still named cramfs, no docs or copyrights etc
are updated. It's still incomplete aswell because we'd probably
want a better selection of files to use xip for, and we'd need
a real API to talk to the lowlevel driver for xip instead of the
physical address hack. I still think it'd be even better to just
hook xip support into jffs or logfs because they give you a full
featured flash filesystem for all needs without the complexity
of strictly partitioning between xip-capable and write parts
of your storage.
The code is at http://verein.lst.de/~hch/cramfs-xip.tar.gz.
On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote:
> I've had a few beer long discussion with Joern Engel and David
> Woodhouse on this one. To cut a long discussion short: the current XIP
> infrastructure is not sufficient to be used on top of mtd. We'd need
> some extenstions:
> - on get_xip_page() we'd need to state if we want the reference
> read-only or read+write
> - we need a put_xip_page() to return references
> - and finally we need a callback for the referece, so that the mtd
> driver can ask to get its reference back (in order to unmap from
> userland when erasing a block)
And we'll need that even when using cramfs. There's not way we'd
merge a hack where the user has to specify a physical address on
the mount command line.
> While I fully agree, that a flash filesystem using xip would be very
> desireable, the proposed cramfs extension is a totaly different beast
> that has its own value to me: the ability to select per file whether
> xip or compression is more efficient.
You'd of course want all that for a full flash filesystem aswell.
On Thu, Jun 07, 2007 at 08:37:07PM +0100, Christoph Hellwig wrote:
> The code is at http://verein.lst.de/~hch/cramfs-xip.tar.gz.
And for thus just wanting to take a quick glance, this is the
diff vs an out of tree cramfs where uncompress.c and cramfs_fs_sb.h
are merged into inode.c:
--- ./inode.c 2007/06/07 11:52:32 1.1
+++ ./inode.c 2007/06/07 14:06:04
@@ -36,6 +36,7 @@
unsigned long blocks;
unsigned long files;
unsigned long flags;
+ void __iomem *linear_virt_addr;
};
static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
@@ -43,12 +44,20 @@
return sb->s_fs_info;
}
+#define CRAMFS_INODE_IS_XIP(x) \
+ ((x)->i_mode & S_ISVTX)
+
+static const struct file_operations cramfs_xip_fops;
static const struct super_operations cramfs_ops;
static const struct inode_operations cramfs_dir_inode_operations;
static const struct file_operations cramfs_directory_operations;
static const struct address_space_operations cramfs_aops;
+static const struct address_space_operations cramfs_xip_aops;
+
+static struct backing_dev_info cramfs_backing_dev_info = {
+ .ra_pages = 0, /* No readahead */
+};
-static DEFINE_MUTEX(read_mutex);
static z_stream stream;
@@ -94,19 +103,31 @@
/* Struct copy intentional */
inode->i_mtime = inode->i_atime = inode->i_ctime = zerotime;
inode->i_ino = CRAMINO(cramfs_inode);
+
+ if (CRAMFS_INODE_IS_XIP(inode))
+ inode->i_mapping->backing_dev_info = &cramfs_backing_dev_info;
+
/* inode->i_nlink is left 1 - arguably wrong for directories,
but it's the best we can do without reading the directory
contents. 1 yields the right result in GNU find, even
without -noleaf option. */
if (S_ISREG(inode->i_mode)) {
- inode->i_fop = &generic_ro_fops;
- inode->i_data.a_ops = &cramfs_aops;
+ if (CRAMFS_INODE_IS_XIP(inode)) {
+ inode->i_fop = &cramfs_xip_fops;
+ inode->i_data.a_ops = &cramfs_xip_aops;
+ } else {
+ inode->i_fop = &generic_ro_fops;
+ inode->i_data.a_ops = &cramfs_aops;
+ }
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &cramfs_dir_inode_operations;
inode->i_fop = &cramfs_directory_operations;
} else if (S_ISLNK(inode->i_mode)) {
inode->i_op = &page_symlink_inode_operations;
- inode->i_data.a_ops = &cramfs_aops;
+ if (CRAMFS_INODE_IS_XIP(inode))
+ inode->i_data.a_ops = &cramfs_xip_aops;
+ else
+ inode->i_data.a_ops = &cramfs_aops;
} else {
inode->i_size = 0;
inode->i_blocks = 0;
@@ -122,42 +143,11 @@
struct inode *inode = iget5_locked(sb, CRAMINO(cramfs_inode),
cramfs_iget5_test, cramfs_iget5_set,
cramfs_inode);
- if (inode && (inode->i_state & I_NEW)) {
+ if (inode && (inode->i_state & I_NEW))
unlock_new_inode(inode);
- }
return inode;
}
-/*
- * We have our own block cache: don't fill up the buffer cache
- * with the rom-image, because the way the filesystem is set
- * up the accesses should be fairly regular and cached in the
- * page cache and dentry tree anyway..
- *
- * This also acts as a way to guarantee contiguous areas of up to
- * BLKS_PER_BUF*PAGE_CACHE_SIZE, so that the caller doesn't need to
- * worry about end-of-buffer issues even when decompressing a full
- * page cache.
- */
-#define READ_BUFFERS (2)
-/* NEXT_BUFFER(): Loop over [0..(READ_BUFFERS-1)]. */
-#define NEXT_BUFFER(_ix) ((_ix) ^ 1)
-
-/*
- * BLKS_PER_BUF_SHIFT should be at least 2 to allow for "compressed"
- * data that takes up more space than the original and with unlucky
- * alignment.
- */
-#define BLKS_PER_BUF_SHIFT (2)
-#define BLKS_PER_BUF (1 << BLKS_PER_BUF_SHIFT)
-#define BUFFER_SIZE (BLKS_PER_BUF*PAGE_CACHE_SIZE)
-
-static unsigned char read_buffers[READ_BUFFERS][BUFFER_SIZE];
-static unsigned buffer_blocknr[READ_BUFFERS];
-static struct super_block * buffer_dev[READ_BUFFERS];
-static int next_buffer;
-
-
/* Returns length of decompressed data. */
static int cramfs_uncompress_block(void *dst, int dstlen, void *src,
int srclen)
@@ -194,78 +184,11 @@
*/
static void *cramfs_read(struct super_block *sb, unsigned int offset, unsigned int len)
{
- struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
- struct page *pages[BLKS_PER_BUF];
- unsigned i, blocknr, buffer, unread;
- unsigned long devsize;
- char *data;
+ struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
if (!len)
return NULL;
- blocknr = offset >> PAGE_CACHE_SHIFT;
- offset &= PAGE_CACHE_SIZE - 1;
-
- /* Check if an existing buffer already has the data.. */
- for (i = 0; i < READ_BUFFERS; i++) {
- unsigned int blk_offset;
-
- if (buffer_dev[i] != sb)
- continue;
- if (blocknr < buffer_blocknr[i])
- continue;
- blk_offset = (blocknr - buffer_blocknr[i]) << PAGE_CACHE_SHIFT;
- blk_offset += offset;
- if (blk_offset + len > BUFFER_SIZE)
- continue;
- return read_buffers[i] + blk_offset;
- }
-
- devsize = mapping->host->i_size >> PAGE_CACHE_SHIFT;
-
- /* Ok, read in BLKS_PER_BUF pages completely first. */
- unread = 0;
- for (i = 0; i < BLKS_PER_BUF; i++) {
- struct page *page = NULL;
-
- if (blocknr + i < devsize) {
- page = read_mapping_page_async(mapping, blocknr + i,
- NULL);
- /* synchronous error? */
- if (IS_ERR(page))
- page = NULL;
- }
- pages[i] = page;
- }
-
- for (i = 0; i < BLKS_PER_BUF; i++) {
- struct page *page = pages[i];
- if (page) {
- wait_on_page_locked(page);
- if (!PageUptodate(page)) {
- /* asynchronous error */
- page_cache_release(page);
- pages[i] = NULL;
- }
- }
- }
-
- buffer = next_buffer;
- next_buffer = NEXT_BUFFER(buffer);
- buffer_blocknr[buffer] = blocknr;
- buffer_dev[buffer] = sb;
-
- data = read_buffers[buffer];
- for (i = 0; i < BLKS_PER_BUF; i++) {
- struct page *page = pages[i];
- if (page) {
- memcpy(data, kmap(page), PAGE_CACHE_SIZE);
- kunmap(page);
- page_cache_release(page);
- } else
- memset(data, 0, PAGE_CACHE_SIZE);
- data += PAGE_CACHE_SIZE;
- }
- return read_buffers[buffer] + offset;
+ return sbi->linear_virt_addr + offset;
}
static void cramfs_put_super(struct super_block *sb)
@@ -282,11 +205,12 @@
static int cramfs_fill_super(struct super_block *sb, void *data, int silent)
{
- int i;
struct cramfs_super super;
unsigned long root_offset;
struct cramfs_sb_info *sbi;
struct inode *root;
+ unsigned long phys_addr;
+ char *p;
sb->s_flags |= MS_RDONLY;
@@ -295,38 +219,59 @@
return -ENOMEM;
sb->s_fs_info = sbi;
- /* Invalidate the read buffers on mount: think disk change.. */
- mutex_lock(&read_mutex);
- for (i = 0; i < READ_BUFFERS; i++)
- buffer_blocknr[i] = -1;
+ p = strstr(data, "physaddr=");
+ if (!p)
+ goto out_kfree;
+
+ phys_addr = simple_strtoul(p + 9, NULL, 0);
+ if (phys_addr & (PAGE_SIZE-1)) {
+ printk(KERN_ERR "cramfs: physical address 0x%lx for linear"
+ "cramfs isn't aligned to a page boundary\n",
+ phys_addr);
+ goto out_kfree;
+ }
+
+ if (phys_addr == 0) {
+ printk(KERN_ERR "cramfs: physical address for linear cramfs"
+ "image can't be 0\n");
+ goto out_kfree;
+ }
+
+ printk(KERN_INFO "cramfs: checking physical address 0x%lx for linear"
+ "cramfs image\n", phys_addr);
+
+ /* Map only one page for now. Will remap it when fs size is known. */
+ sbi->linear_virt_addr = ioremap(phys_addr, PAGE_SIZE);
+ if (!sbi->linear_virt_addr) {
+ printk(KERN_ERR "cramfs: ioremap of the linear cramfs image"
+ "failed\n");
+ goto out_kfree;
+ }
/* Read the first block and get the superblock from it */
memcpy(&super, cramfs_read(sb, 0, sizeof(super)), sizeof(super));
- mutex_unlock(&read_mutex);
/* Do sanity checks on the superblock */
if (super.magic != CRAMFS_MAGIC) {
/* check at 512 byte offset */
- mutex_lock(&read_mutex);
memcpy(&super, cramfs_read(sb, 512, sizeof(super)), sizeof(super));
- mutex_unlock(&read_mutex);
if (super.magic != CRAMFS_MAGIC) {
if (!silent)
printk(KERN_ERR "cramfs: wrong magic\n");
- goto out;
+ goto out_iounmap;
}
}
/* get feature flags first */
if (super.flags & ~CRAMFS_SUPPORTED_FLAGS) {
printk(KERN_ERR "cramfs: unsupported filesystem features\n");
- goto out;
+ goto out_iounmap;
}
/* Check that the root inode is in a sane state */
if (!S_ISDIR(super.root.mode)) {
printk(KERN_ERR "cramfs: root is not a directory\n");
- goto out;
+ goto out_iounmap;
}
root_offset = super.root.offset << 2;
if (super.flags & CRAMFS_FLAG_FSID_VERSION_2) {
@@ -347,21 +292,34 @@
(root_offset != 512 + sizeof(struct cramfs_super))))
{
printk(KERN_ERR "cramfs: bad root offset %lu\n", root_offset);
- goto out;
+ goto out_iounmap;
}
/* Set it all up.. */
sb->s_op = &cramfs_ops;
root = get_cramfs_inode(sb, &super.root);
if (!root)
- goto out;
+ goto out_iounmap;
sb->s_root = d_alloc_root(root);
if (!sb->s_root) {
iput(root);
- goto out;
+ goto out_iounmap;
}
+
+ /* Remap the whole filesystem now */
+ iounmap(sbi->linear_virt_addr);
+ sbi->linear_virt_addr = ioremap(phys_addr, sbi->size);
+ if (!sbi->linear_virt_addr) {
+ printk(KERN_ERR "cramfs: ioremap of the linear cramfs image"
+ " failed\n");
+ goto out_iounmap;
+ }
+
return 0;
-out:
+
+ out_iounmap:
+ iounmap(sbi->linear_virt_addr);
+ out_kfree:
kfree(sbi);
sb->s_fs_info = NULL;
return -EINVAL;
@@ -382,6 +340,13 @@
return 0;
}
+static int cramfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+ return -EINVAL;
+ return xip_file_mmap(file, vma);
+}
+
/*
* Read a cramfs directory entry.
*/
@@ -414,7 +379,6 @@
mode_t mode;
int namelen, error;
- mutex_lock(&read_mutex);
de = cramfs_read(sb, OFFSET(inode) + offset, sizeof(*de)+256);
name = (char *)(de+1);
@@ -427,7 +391,6 @@
memcpy(buf, name, namelen);
ino = CRAMINO(de);
mode = de->mode;
- mutex_unlock(&read_mutex);
nextoffset = offset + sizeof(*de) + namelen;
for (;;) {
if (!namelen) {
@@ -458,7 +421,6 @@
unsigned int offset = 0;
int sorted;
- mutex_lock(&read_mutex);
sorted = CRAMFS_SB(dir->i_sb)->flags & CRAMFS_FLAG_SORTED_DIRS;
while (offset < dir->i_size) {
struct cramfs_inode *de;
@@ -481,7 +443,6 @@
for (;;) {
if (!namelen) {
- mutex_unlock(&read_mutex);
return ERR_PTR(-EIO);
}
if (name[namelen-1])
@@ -495,7 +456,6 @@
continue;
if (!retval) {
struct cramfs_inode entry = *de;
- mutex_unlock(&read_mutex);
d_add(dentry, get_cramfs_inode(dir->i_sb, &entry));
return NULL;
}
@@ -503,7 +463,6 @@
if (sorted)
break;
}
- mutex_unlock(&read_mutex);
d_add(dentry, NULL);
return NULL;
}
@@ -522,23 +481,19 @@
u32 start_offset, compr_len;
start_offset = OFFSET(inode) + maxblock*4;
- mutex_lock(&read_mutex);
if (page->index)
start_offset = *(u32 *) cramfs_read(sb, blkptr_offset-4, 4);
compr_len = (*(u32 *) cramfs_read(sb, blkptr_offset, 4) - start_offset);
- mutex_unlock(&read_mutex);
pgdata = kmap(page);
if (compr_len == 0)
; /* hole */
else if (compr_len > (PAGE_CACHE_SIZE << 1))
printk(KERN_ERR "cramfs: bad compressed blocksize %u\n", compr_len);
else {
- mutex_lock(&read_mutex);
bytes_filled = cramfs_uncompress_block(pgdata,
PAGE_CACHE_SIZE,
cramfs_read(sb, start_offset, compr_len),
compr_len);
- mutex_unlock(&read_mutex);
}
} else
pgdata = kmap(page);
@@ -550,13 +505,34 @@
return 0;
}
+static struct page *cramfs_get_xip_page(struct address_space *mapping,
+ sector_t offset, int create)
+{
+ struct inode *inode = mapping->host;
+ struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
+ unsigned long address;
+
+ address = PAGE_ALIGN((unsigned long)
+ (sbi->linear_virt_addr + OFFSET(inode)));
+
+ /* FIXME -- This shouldn't be hard coded */
+ address += (offset * 512);
+
+ return virt_to_page(address);
+}
+
static const struct address_space_operations cramfs_aops = {
- .readpage = cramfs_readpage
+ .readpage = cramfs_readpage,
};
-/*
- * Our operations:
- */
+static const struct address_space_operations cramfs_xip_aops = {
+ .get_xip_page = cramfs_get_xip_page,
+};
+
+static const struct file_operations cramfs_xip_fops = {
+ .read = xip_file_read,
+ .mmap = cramfs_mmap,
+};
/*
* A directory can only readdir
@@ -580,16 +556,14 @@
static int cramfs_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
{
- return get_sb_bdev(fs_type, flags, dev_name, data, cramfs_fill_super,
- mnt);
+ return get_sb_nodev(fs_type, flags, data, cramfs_fill_super, mnt);
}
static struct file_system_type cramfs_fs_type = {
.owner = THIS_MODULE,
.name = "cramfs",
.get_sb = cramfs_get_sb,
- .kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .kill_sb = kill_anon_super,
};
static int __init init_cramfs_fs(void)
On 6/7/07, Christoph Hellwig <[email protected]> wrote:
> On Thu, Jun 07, 2007 at 08:37:07PM +0100, Christoph Hellwig wrote:
> > The code is at http://verein.lst.de/~hch/cramfs-xip.tar.gz.
>
> And for thus just wanting to take a quick glance, this is the
> diff vs an out of tree cramfs where uncompress.c and cramfs_fs_sb.h
> are merged into inode.c:
Cool. I notice you removed my UML hacks... Why?
I just don't get one thing. This is almost a duplicate of
cramfs-block. Why would we prefer a fork with a lot of code
duplication to adding a couple alternate code paths in cramfs-block?
Also keep in mind there are several reasons why you might want to have
block access to to a XIP built cramfs image. I am unpersuaded that
this fork approach is fundamentally better.
On 6/7/07, Christoph Hellwig <[email protected]> wrote:
> On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote:
> > I've had a few beer long discussion with Joern Engel and David
> > Woodhouse on this one. To cut a long discussion short: the current XIP
> > infrastructure is not sufficient to be used on top of mtd. We'd need
> > some extenstions:
> > - on get_xip_page() we'd need to state if we want the reference
> > read-only or read+write
> > - we need a put_xip_page() to return references
> > - and finally we need a callback for the referece, so that the mtd
> > driver can ask to get its reference back (in order to unmap from
> > userland when erasing a block)
>
> And we'll need that even when using cramfs. There's not way we'd
> merge a hack where the user has to specify a physical address on
> the mount command line.
Why not? For the use case in question the user usually manually
burned the image to a physical address before hand. Many of these
system don't have MTD turned on for this Flash, they don't need it
because they don't write to this Flash once the system is up.
> that even more important doesn't require pulling in
> the whole block layer which is especially important for embedded
> devices at the lower end of the scala.
Good point. That is a big oversight. Though I would prefer to handle
that in the same fs rather than fork.
> I still think it'd be even better to just
> hook xip support into jffs or logfs because they give you a full
> featured flash filesystem for all needs without the complexity
> of strictly partitioning between xip-capable and write parts
> of your storage.
This is nirvana. But it is not the goal of the patches in question.
In fact there are several use cases that don't need and don't value
the writeability and don't need therefore the overhead. It is a long
term goal never the less.
On Thu, Jun 07, 2007 at 02:11:44PM -0700, Jared Hulbert wrote:
> >that even more important doesn't require pulling in
> >the whole block layer which is especially important for embedded
> >devices at the lower end of the scala.
>
> Good point. That is a big oversight. Though I would prefer to handle
> that in the same fs rather than fork.
If if were actually talking about complex filesystem I'd agree. But
the cramfs xip patch posted here touches about 2/3 of the number of
lines that cramfs has in total. And cramfs is not exactly the best
base to start with..
> >I still think it'd be even better to just
> >hook xip support into jffs or logfs because they give you a full
> >featured flash filesystem for all needs without the complexity
> >of strictly partitioning between xip-capable and write parts
> >of your storage.
>
> This is nirvana. But it is not the goal of the patches in question.
> In fact there are several use cases that don't need and don't value
> the writeability and don't need therefore the overhead. It is a long
> term goal never the less.
With the filemap_xip.c helpers adding xip support to any filesystem
is pretty trivial for the highlevel filesystem operations. The only
interesting bit is the lowlevel code (the get_xip_page method and
the others Carsten mentioned), but we need to do these lowlevel
code in a generic and proper way anyway.
I'll try to hack up an xip prototype for jffs2 next week.
> I've had a few beer long discussion with Joern Engel and David
> Woodhouse on this one. To cut a long discussion short: the current XIP
> infrastructure is not sufficient to be used on top of mtd. We'd need
> some extenstions:
> - on get_xip_page() we'd need to state if we want the reference
> read-only or read+write
> - we need a put_xip_page() to return references
> - and finally we need a callback for the referece, so that the mtd
> driver can ask to get its reference back (in order to unmap from
> userland when erasing a block)
Yes. And one more thing. We can't assume every page in a file is XIP or not.
However, I still can't get even the existing get_xip_page() to work
for me so we are getting ahead of ourselves;) Looking back on this
thread I realize I haven't confirmed if my cramfs_get_xip_page() gets
a page struct. I assume that is my problem? The UML find_iomem()
probably returns psuedo iomem with page structs. While ioremap() does
not return with page struct backed memory.
> If if were actually talking about complex filesystem I'd agree. But
> the cramfs xip patch posted here touches about 2/3 of the number of
> lines that cramfs has in total.
Fair enough. But look at the complexity rather than number of lines.
It adds tedium to the cramfs_fill_super and one extra level of
indirection to a hand full of ops like mmap() and cramfs_read(). But
the changes to the real meat of cramfs, cramfs_readpage(), are limited
to the XIP changes, which I want on block devices anyway.
So if we did fork cramfs I would submit a simple patch to cramfs for
XIP support on block devices and I would submit a patch for a new
filesystem, cramfs-linear. Cramfs-linear would have an exact copy of
1/3 of the cramfs code such as cramfs_readpage(), it would use the
same headers, and it would use the same userspace tools.
This fork is what the community wants? Speak up!
>And cramfs is not exactly the best base to start with..
This is a moot point, there is a significant installed base issue.
There are lots of cramfs-linear-xip based systems in existance with
can't be easily ported to newer kernel because of a lack of support.
> > This is nirvana. But it is not the goal of the patches in question.
> > In fact there are several use cases that don't need and don't value
> > the writeability and don't need therefore the overhead. It is a long
> > term goal never the less.
>
> With the filemap_xip.c helpers adding xip support to any filesystem
> is pretty trivial for the highlevel filesystem operations. The only
> interesting bit is the lowlevel code (the get_xip_page method and
> the others Carsten mentioned), but we need to do these lowlevel
> code in a generic and proper way anyway.
It's not that trivial. The filesystem needs to meet several
requirements such as, having data nodes that are page aligned.
Anytime any changes are made to any page in the underlying Flash block
or if the Flash physical partition goes out of read mode you've got to
hide that from userspace or otherwise deal with it. A filesystem that
doesn't understand these subtle hardware requirements would either not
work at all, have lots of deadlock issues, or at least have terrible
performance problems. Nevertheless I supose a simple, but invasive,
hack could likely produce a worthwhile proof of concept.
I think this is worthy of it's own thread....
> I'll try to hack up an xip prototype for jffs2 next week.
Very cool. I can't wait to see what you have in mind. But remember
this doesn't solve the problem of the huge installed base of
cramfs-linear-xip images.
Gee I think it seems logfs would be a better choice. Jffs2 and
ubifs(jffs3) for that matter combine node and node header in series
which means your data nodes aren't aligned to page boundarys. Logfs
nodes could be more easily aligned.
Christoph Hellwig wrote:
> On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote:
>> I've had a few beer long discussion with Joern Engel and David
>> Woodhouse on this one. To cut a long discussion short: the current XIP
>> infrastructure is not sufficient to be used on top of mtd. We'd need
>> some extenstions:
>> - on get_xip_page() we'd need to state if we want the reference
>> read-only or read+write
>> - we need a put_xip_page() to return references
>> - and finally we need a callback for the referece, so that the mtd
>> driver can ask to get its reference back (in order to unmap from
>> userland when erasing a block)
>
> And we'll need that even when using cramfs. There's not way we'd
> merge a hack where the user has to specify a physical address on
> the mount command line.
Nay, the proposed solution is read only. From what I understood of the
discussion, it does'nt need that. The entire flash memory can be read
at the same time. Things start to get interresting when you want to
write, then you need to care about erase blocks and such.
>> While I fully agree, that a flash filesystem using xip would be very
>> desireable, the proposed cramfs extension is a totaly different beast
>> that has its own value to me: the ability to select per file whether
>> xip or compression is more efficient.
>
> You'd of course want all that for a full flash filesystem aswell.
True.
so long,
Carsten
On Fri, Jun 08, 2007 at 09:17:18AM +0200, Carsten Otte wrote:
> Christoph Hellwig wrote:
> >On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote:
> >>I've had a few beer long discussion with Joern Engel and David
> >>Woodhouse on this one. To cut a long discussion short: the current XIP
> >>infrastructure is not sufficient to be used on top of mtd. We'd need
> >>some extenstions:
> >>- on get_xip_page() we'd need to state if we want the reference
> >>read-only or read+write
> >>- we need a put_xip_page() to return references
> >>- and finally we need a callback for the referece, so that the mtd
> >>driver can ask to get its reference back (in order to unmap from
> >>userland when erasing a block)
> >
> >And we'll need that even when using cramfs. There's not way we'd
> >merge a hack where the user has to specify a physical address on
> >the mount command line.
> Nay, the proposed solution is read only. From what I understood of the
> discussion, it does'nt need that. The entire flash memory can be read
> at the same time. Things start to get interresting when you want to
> write, then you need to care about erase blocks and such.
We can have a simpler variant as a start if we really want. But we
need to pass it through the mtd layer. There is a reason we have this
thing called devices drivers, and we don't want to add knowledge of
ioremap to the filesystems and have users find out physical addresses
of their flash to pass it as mount option.
On Thu, Jun 07, 2007 at 01:34:12PM -0700, Jared Hulbert wrote:
> >And we'll need that even when using cramfs. There's not way we'd
> >merge a hack where the user has to specify a physical address on
> >the mount command line.
>
> Why not? For the use case in question the user usually manually
> burned the image to a physical address before hand. Many of these
> system don't have MTD turned on for this Flash, they don't need it
> because they don't write to this Flash once the system is up.
Then add a small device layer for it. Remember that linux is not all
about hacked up embedded devices that get shipped once and never
touched again. If we put something in mainline we want proper layering,
and putting ioremap (which doesn't even exist on some architectures) into
a filesystem and having user remember memory addresses and pass them
to mount doesn't fit there.
Christoph Hellwig wrote:
> And for thus just wanting to take a quick glance, this is the
> diff vs an out of tree cramfs where uncompress.c and cramfs_fs_sb.h
> are merged into inode.c:
Mkay, I am convinced. Either cramfs should be based on a block device
that does have direct_access() for xip, or the code needs to be
forked. Looks like a sane approach to me.
so long,
Carsten
Christoph Hellwig wrote:
> We can have a simpler variant as a start if we really want. But we
> need to pass it through the mtd layer. There is a reason we have this
> thing called devices drivers, and we don't want to add knowledge of
> ioremap to the filesystems and have users find out physical addresses
> of their flash to pass it as mount option.
I see your poing with regard to layering, and I agree to it.
In order to do ioremap/iounmap at least we need a counterpart
put_xip_page thing to do iounmap in that path. Our dcss segments on
390 did not raise that requirement: they had a permanent kernel mapping.
The quiz question to me is: why don't we establish a permanenet
mapping of the entire thing from mount() to unmount(). That eliminates
the need to do iomap/iounmap, eliminates the need to have
put_xip_page, and eliminates to care about what layer would do this.
Would work for cramfs, won't work for read+write flash filesystems.
On Fri, Jun 08, 2007 at 09:50:17AM +0200, Carsten Otte wrote:
> I see your poing with regard to layering, and I agree to it.
> In order to do ioremap/iounmap at least we need a counterpart
> put_xip_page thing to do iounmap in that path. Our dcss segments on
> 390 did not raise that requirement: they had a permanent kernel mapping.
> The quiz question to me is: why don't we establish a permanenet
> mapping of the entire thing from mount() to unmount(). That eliminates
> the need to do iomap/iounmap, eliminates the need to have
> put_xip_page, and eliminates to care about what layer would do this.
> Would work for cramfs, won't work for read+write flash filesystems.
Jared's patch currently does ioremap on mount (and no iounmap at all).
That mapping needs to move from the filesystem to the device driver.
Christoph Hellwig wrote:
> Jared's patch currently does ioremap on mount (and no iounmap at all).
> That mapping needs to move from the filesystem to the device driver.
The device driver needs to do ioremap on open(), and iounmap() on
release. That's effectively what our block driver does.
On Fri, Jun 08, 2007 at 09:59:20AM +0200, Carsten Otte wrote:
> Christoph Hellwig wrote:
> >Jared's patch currently does ioremap on mount (and no iounmap at all).
> >That mapping needs to move from the filesystem to the device driver.
> The device driver needs to do ioremap on open(), and iounmap() on
> release. That's effectively what our block driver does.
Yes, exactly.
On Thu, 7 June 2007 22:15:19 +0100, Christoph Hellwig wrote:
>
> With the filemap_xip.c helpers adding xip support to any filesystem
> is pretty trivial for the highlevel filesystem operations. The only
> interesting bit is the lowlevel code (the get_xip_page method and
> the others Carsten mentioned), but we need to do these lowlevel
> code in a generic and proper way anyway.
>
> I'll try to hack up an xip prototype for jffs2 next week.
I'd absolutely love to see this code! However, it may be a little
harder than you think.
XIP is basically limited to NOR flash. And the NOR interface is to
simply map all flash to some physical memory area and interpret
reads/writes to/from this area in a special way. Internally the flash
chip is implementing a state machine that is controlled by writes and
defines the data read.
Easiest case: read_array state. In this state, every address will
return the flash content associated to that address on read. Pretty
much any write will cause a transition to a different state, though.
Any state other than read_array will return flash registers on any read.
In other words, writing to flash prohibits XIP, at least temporarily. A
read-write xip flash filesystem needs to tear down any xip mapping
before writing and have the flash chip return to read_array mode before
setting up a new mapping.
If the underlying device consists of several chips or of chips that
implement seperate state machines for different areas of the chip (some
Intel chips), only mappings belonging to the chip/area currently being
written are affected. That should make quite a nice optimizations,
although it can be ignored for a first shot.
Jörn
--
Joern's library part 12:
http://physics.nist.gov/cuu/Units/binary.html
On Thu, 7 June 2007 15:59:00 -0700, Jared Hulbert wrote:
>
> Gee I think it seems logfs would be a better choice. Jffs2 and
> ubifs(jffs3) for that matter combine node and node header in series
> which means your data nodes aren't aligned to page boundarys. Logfs
> nodes could be more easily aligned.
That used to be the case before I did a format change. Compression
pretty much requires a format very similar to JFFS2. The alternative
would be _huge_ complexity.
However, it would be possible to create specialized XIP segments with
the old page-aligned format. Supporting two kinds of segment format is
still simpler than supporting compression with the old format.
Jörn
--
With a PC, I always felt limited by the software available. On Unix,
I am limited only by my knowledge.
-- Peter J. Schoenster
On 6/8/07, Christoph Hellwig <[email protected]> wrote:
> On Thu, Jun 07, 2007 at 01:34:12PM -0700, Jared Hulbert wrote:
> > >And we'll need that even when using cramfs. There's not way we'd
> > >merge a hack where the user has to specify a physical address on
> > >the mount command line.
> >
> > Why not? For the use case in question the user usually manually
> > burned the image to a physical address before hand. Many of these
> > system don't have MTD turned on for this Flash, they don't need it
> > because they don't write to this Flash once the system is up.
>
> Then add a small device layer for it. Remember that linux is not all
> about hacked up embedded devices that get shipped once and never
> touched again.
Remember that linux is not all about big iron machines with lots of
processors and gigabytes of RAM :)
I concede your layer point, ioremap() doesn't belong in the filesystem.
On 6/8/07, Christoph Hellwig <[email protected]> wrote:
> On Fri, Jun 08, 2007 at 09:59:20AM +0200, Carsten Otte wrote:
> > Christoph Hellwig wrote:
> > >Jared's patch currently does ioremap on mount (and no iounmap at all).
> > >That mapping needs to move from the filesystem to the device driver.
> > The device driver needs to do ioremap on open(), and iounmap() on
> > release. That's effectively what our block driver does.
>
> Yes, exactly.
Okay so we need some driver that opens/closes this ROM. This has been
done from the dcss block device but that doesn't make sense for most
embedded systems. The MTD allows for this with point(),unpoint().
That should work just fine. It does introduce the MTD as a dependancy
which is unnecessary in many systems, but it will work now.
On Fri, Jun 08, 2007 at 09:05:32AM -0700, Jared Hulbert wrote:
> Okay so we need some driver that opens/closes this ROM. This has been
> done from the dcss block device but that doesn't make sense for most
> embedded systems. The MTD allows for this with point(),unpoint().
> That should work just fine. It does introduce the MTD as a dependancy
> which is unnecessary in many systems, but it will work now.
The Linux solution to this problem would be to introduce an option for
mtd write support. That way the majority of the code doesn't heave to
be compiled for the read-only case but you still get a uniform interface.
> On Fri, Jun 08, 2007 at 09:05:32AM -0700, Jared Hulbert wrote:
> > Okay so we need some driver that opens/closes this ROM. This has been
> > done from the dcss block device but that doesn't make sense for most
> > embedded systems. The MTD allows for this with point(),unpoint().
> > That should work just fine. It does introduce the MTD as a dependancy
> > which is unnecessary in many systems, but it will work now.
>
> The Linux solution to this problem would be to introduce an option for
> mtd write support. That way the majority of the code doesn't heave to
> be compiled for the read-only case but you still get a uniform interface.
You mean make an MTD-light interface possible?
On Fri, Jun 08, 2007 at 09:11:47AM -0700, Jared Hulbert wrote:
> >On Fri, Jun 08, 2007 at 09:05:32AM -0700, Jared Hulbert wrote:
> >> Okay so we need some driver that opens/closes this ROM. This has been
> >> done from the dcss block device but that doesn't make sense for most
> >> embedded systems. The MTD allows for this with point(),unpoint().
> >> That should work just fine. It does introduce the MTD as a dependancy
> >> which is unnecessary in many systems, but it will work now.
> >
> >The Linux solution to this problem would be to introduce an option for
> >mtd write support. That way the majority of the code doesn't heave to
> >be compiled for the read-only case but you still get a uniform interface.
>
> You mean make an MTD-light interface possible?
I wouldn't call it that. The interface should stay the same except that
write operations are not supported. In mtd_info you'd just have
point/unpoint and read/read_oob but no write operations and all the backing
code wouldn't be compiled in aswell.
On Fri, 8 June 2007 17:15:26 +0100, Christoph Hellwig wrote:
>
> I wouldn't call it that. The interface should stay the same except that
> write operations are not supported. In mtd_info you'd just have
> point/unpoint and read/read_oob but no write operations and all the backing
> code wouldn't be compiled in aswell.
Nitbit: Sooner or later the point/unpoint should get replaced by
something with page granularity. Something also needs to keep lists of
mapped pages and invalidate them whenever the device is written to.
That could be done in the filesystem or device driver. I believe the
device driver would be a better solution.
Jörn
--
Write programs that do one thing and do it well. Write programs to work
together. Write programs to handle text streams, because that is a
universal interface.
-- Doug MacIlroy
Jörn Engel wrote:
> Nitbit: Sooner or later the point/unpoint should get replaced by
> something with page granularity. Something also needs to keep lists of
> mapped pages and invalidate them whenever the device is written to.
> That could be done in the filesystem or device driver. I believe the
> device driver would be a better solution.
I think it needs to work like this:
- temporary references (for read/write syscalls and friends) get
retrieved via get_xip_page and returned again via to-be-invented
put_page/page_cache_release
- permanent references (for mapping to userland) get retrieved via
get_xip_page and don't get returned until unmap
- the device driver can access page->count via a helper function
provided by mm. This way, it can identify which pages are in use.
- In order to get references back, the device driver can call a
callback provided by the file system. A default implementation will go
to filemap_xip.c. This callback would use rmap to find all mappings,
and unmap the page via xip_file_unmap()[mm/filemap_xip.c].
The nice thing about this approach is: we use page->count and rmap,
both already exist and are perfectly suited for our purpose.
The downside: We need mem_map[] struct page entries behind all memory
segments. Nowerdays we can easily create those via vmem_map/sparsemem.
Opinions?
Carsten Otte wrote:
> - temporary references (for read/write syscalls and friends) get
> retrieved via get_xip_page and returned again via to-be-invented
> put_page/page_cache_release
Doh! Meant to state: returned again via stock
put_page/page_cache_release rather then inventing something new.
Whenever writes/erases to the device happen, the device driver would
need to call a function like
/**
* unmap_page_range - remove all mapping to the given range of an address space
* @mapping - the address space in question
* @start_index - index of the first page in the range
* @no_pages - number of pages to get unmapped
*
* Returns 0 on success or a negative errno value.
*/
int unmap_page_range(struct address_space *mapping, loff_t start_index,
loff_t no_pages);
or implement something equivalent itself. Your filesystem callback
looks like it would be just that, although I may be misreading you.
On Fri, 8 June 2007 21:04:17 +0200, Carsten Otte wrote:
> I think it needs to work like this:
> - temporary references (for read/write syscalls and friends) get
> retrieved via get_xip_page and returned again via to-be-invented
> put_page/page_cache_release
Either that or using standard mtd->read() and mtd->write() calls. I see
some advantages to mtd->write() in particular, as the device driver
needs some notification to trigger unmap_page_range() before the actual
write and chip state transitions happen. mtd->write() seems much easier
than something like
mtd->pre_write()
get_xip_page()
...
put_page()
mtd->post_write()
If get_xip_page() only has userland consumers all the locking can be
kept inside device drivers.
> - permanent references (for mapping to userland) get retrieved via
> get_xip_page and don't get returned until unmap
Yep.
> - the device driver can access page->count via a helper function
> provided by mm. This way, it can identify which pages are in use.
One of us is confused here. The driver would have to check page->count
for a large range of pages, usually the whole chip. And it would have
to tear down every single mapping before starting to write. Is that
possible and desirable to do with page->count? Unsure.
> - In order to get references back, the device driver can call a
> callback provided by the file system. A default implementation will go
> to filemap_xip.c. This callback would use rmap to find all mappings,
> and unmap the page via xip_file_unmap()[mm/filemap_xip.c].
Jörn
--
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt
> The downside: We need mem_map[] struct page entries behind all memory
> segments. Nowerdays we can easily create those via vmem_map/sparsemem.
>
> Opinions?
Frankly this is going to be mostly relevant on ARM architectures at
least at first. Maybe I'm missing something but I don't see that
sparemem is supported on ARM...
Jörn Engel wrote:
> Whenever writes/erases to the device happen, the device driver would
> need to call a function like
>
> /**
> * unmap_page_range - remove all mapping to the given range of an address space
> * @mapping - the address space in question
> * @start_index - index of the first page in the range
> * @no_pages - number of pages to get unmapped
> *
> * Returns 0 on success or a negative errno value.
> */
> int unmap_page_range(struct address_space *mapping, loff_t start_index,
> loff_t no_pages);
>
> or implement something equivalent itself. Your filesystem callback
> looks like it would be just that, although I may be misreading you.
No. That's how the callback could look alike.
> Either that or using standard mtd->read() and mtd->write() calls. I see
> some advantages to mtd->write() in particular, as the device driver
> needs some notification to trigger unmap_page_range() before the actual
> write and chip state transitions happen. mtd->write() seems much easier
> than something like
>
> mtd->pre_write()
> get_xip_page()
> ...
> put_page()
> mtd->post_write()
>
> If get_xip_page() only has userland consumers all the locking can be
> kept inside device drivers.
Hmmh. We won't need mtd->pre_write(), because the file system's
get_xip_page aop will have to ask mtd for the address anyway similar
to the way ext2_get_xip_page does call bdev_ops->direct_access.
If that call would pass the information whether the future access is
read-only or read+write, the device driver could do its housekeeping.
I think we should also cover put_page() plus mtd->post_write() inside
a put_xip_page() address space operation provided by the fs. This way
calls are balanced, and we have stuff in a single place rather then
duplicating code. The fs could rely on a generic implementation in
filemap_xip in case it does'nt need to do its own magic here.
>> - the device driver can access page->count via a helper function
>> provided by mm. This way, it can identify which pages are in use.
>
> One of us is confused here. The driver would have to check page->count
> for a large range of pages, usually the whole chip. And it would have
> to tear down every single mapping before starting to write. Is that
> possible and desirable to do with page->count? Unsure.
That point was indeed confusing. Let my try again:
The only way, how an initial reference to a page can be retrieved, is
from the driver by using a bdev_ops->direct_access alike call. The
driver has to be able to check, if all references have been returned.
That would be done in three steps:
1. stop handing out new references
2. use the unmap_page_range callback to get references for page table
entries back
3. check if other (temporary) references are still handed out, wait
until they get returned.
Using a helper function to look into page->count is a possible
implementation for #3. And because the page->count cannot get changed
while noone has a reference, I don't see a race condition in looping
over all pages and checking page->count. This implementation has an
advantage, if the device driver wants to make sure that a small page
range is not accessed.
If the device driver needs to ensure that there are no references to
any page on the enitre flash media, it might use a global counter to
count the sum of references and save looping over all pages.
so long,
Carsten
Justin Treon wrote:
> --- Carsten Otte <[email protected]> wrote:
>> The nice thing about this approach is: we use page->count and rmap,
>> both already exist and are perfectly suited for our purpose.
>> The downside: We need mem_map[] struct page entries behind all memory
>> segments. Nowerdays we can easily create those via vmem_map/sparsemem.
> Can you give an example of the proper way to use sparsemem in this case?
We use vmem_map on 390:
- __segment_load[arch/s390/mm/extmem.c] calls add_shared_memory() to
create struct page entries behind a segment
- segment_unload (same file) calls remove_shared_memory() to remove
parts of the mem_map array again
- arch/s390/mm/vmem.c implements both callbacks
As far as I know, this can also be archived with sparsemem.
Unfortunately I don't have an example for sparsemem at hand, I would
try to use sparse_add_one_section()[mm/sparse.c].
sorry,
Carsten
On Sat, 9 June 2007 09:55:15 +0200, Carsten Otte wrote:
> Jörn Engel wrote:
>
> >Either that or using standard mtd->read() and mtd->write() calls. I see
> >some advantages to mtd->write() in particular, as the device driver
> >needs some notification to trigger unmap_page_range() before the actual
> >write and chip state transitions happen. mtd->write() seems much easier
> >than something like
> >
> >mtd->pre_write()
> >get_xip_page()
> >...
> >put_page()
> >mtd->post_write()
> >
> >If get_xip_page() only has userland consumers all the locking can be
> >kept inside device drivers.
> Hmmh. We won't need mtd->pre_write(), because the file system's
> get_xip_page aop will have to ask mtd for the address anyway similar
> to the way ext2_get_xip_page does call bdev_ops->direct_access.
>
> If that call would pass the information whether the future access is
> read-only or read+write, the device driver could do its housekeeping.
>
> I think we should also cover put_page() plus mtd->post_write() inside
> a put_xip_page() address space operation provided by the fs. This way
> calls are balanced, and we have stuff in a single place rather then
> duplicating code. The fs could rely on a generic implementation in
> filemap_xip in case it does'nt need to do its own magic here.
That sounds like an awefully long critical section. Remember, while the
write is going on (the chip isn't in read_array mode) the cannot be any
xip pages used anywhere in the system. All existing mapping have to be
revoked and no new mappings created handed out while this goes on.
If we use a single call like mtd->write(), this critical section is
limited to one call and the device driver can keep it as short as
possible. Having two calls and allowing crappy filesystems to spend a
long time between them... ugh.
Another thing that just dawned upon me is that the device driver must be
in control of xip. You can partition a single chip and use two
filesystems on it. If one fs is xip-aware and the other is, say, jffs2,
only the device driver can revoke mapping when jffs2 writes to the chip.
> >>- the device driver can access page->count via a helper function
> >>provided by mm. This way, it can identify which pages are in use.
> >
> >One of us is confused here. The driver would have to check page->count
> >for a large range of pages, usually the whole chip. And it would have
> >to tear down every single mapping before starting to write. Is that
> >possible and desirable to do with page->count? Unsure.
> That point was indeed confusing. Let my try again:
You obviously have more experience with xip and quickly go into details
I don't understand yet. So before I try to decipher your explanation
again, let me give you a very high level picture. Something simple
enough for me to understand. :)
Fundamentally four kinds of functionality are required:
1 Handing out new xip pages for userspace.
2 Destroying xip pages for userspace.
3 Temporarily revoking any pages given out for current writes.
4 Handling page faults on revoked pages.
5 Reenabling revoked pages.
AFAICS, 1 and 2 are required for dcss xip as well. 3 is new and only
necessary for flash. 4 should be mostly old, 5 again is new.
Most likely 3 will require a TLB flush on most architectures plus some
provisions that page faults on the xip pages are handled by putting the
processes on a wait queue, done in 4. 5 will remove those provisions
again, wake everything on the waitqueue and allow userspace to proceed
accessing xip pages.
Roughly sane?
Jörn
--
To announce that there must be no criticism of the President, or that we
are to stand by the President, right or wrong, is not only unpatriotic
and servile, but is morally treasonable to the American public.
-- Theodore Roosevelt, Kansas City Star, 1918
> Nick Piggin wrote:
> > The question is, why is that not enough (I haven't looked at these
> > patches enough to work out if there is anything more they provide).
> I think, it just takes trying things out. From reading the code, I
> think this should work well for the filemap_xip code with no struct page.
> Also, we need eliminate nopage() to get rid of the struct page.
> Unfortunately I don't find time to try this out for now, and on 390 we
> can live with struct page for the time being. In contrast to the
> embedded platforms, the mem_mep array gets swapped out to disk by our
> hypervisor.
Can you help me understand the comment about nopage()? Do you mean
set xip_file_vm_ops.nopage to NULL?
Jared Hulbert wrote:
>> Nick Piggin wrote:
>> > The question is, why is that not enough (I haven't looked at these
>> > patches enough to work out if there is anything more they provide).
>> I think, it just takes trying things out. From reading the code, I
>> think this should work well for the filemap_xip code with no struct page.
>> Also, we need eliminate nopage() to get rid of the struct page.
>> Unfortunately I don't find time to try this out for now, and on 390 we
>> can live with struct page for the time being. In contrast to the
>> embedded platforms, the mem_mep array gets swapped out to disk by our
>> hypervisor.
>
> Can you help me understand the comment about nopage()? Do you mean
> set xip_file_vm_ops.nopage to NULL?
Yes, but not without replacement. Today, the page table entry for xip
mappings is created like this:
__handle_mm_fault calls handle_pte_fault, which calls do_no_page [all
in mm/memory.c]. do_no_page does call the ->nopage operation, and
creates a page table entry. The ->nopage operation is set to
xip_file_nopage() [mm/filemap_xip.c], which calls the get_xip_page
address space operation. For ext2, this is implemented in
ext2_get_xip_page (fs/ext2/xip.c).
In this process, the struct page entry in mem_map is looked up in
ext2_get_xip_page(), returned to xip_file_nopage, then returned to
do_no_page.
An alternative approach, which does not need to have struct page at
hand, would be to use the nopfn vm operations struct. That one would
have to rely on get_xip_pfn. The current path would then be
deprecated. If you're interrested in using the later for xip without
struct page, I would volounteer to go ahead and implement this?
so long,
Carsten
> An alternative approach, which does not need to have struct page at
> hand, would be to use the nopfn vm operations struct. That one would
> have to rely on get_xip_pfn.
Of course! Okay now I'm begining to understand.
> The current path would then be deprecated.
Why? Wouldn't both paths be valid options?
> If you're interrested in using the later for xip without
> struct page, I would volounteer to go ahead and implement this?
I'm very interested in this.
I'm not opposed to using struct page, but I'm confused as to how to
start that. As I understand it, which is not well, defined a
CONFIG_DISCONTIGMEM region to cover the Flash memory would add that to
my pool of RAM. That would be 'bad', right? I don't see how to
create the page structs and set this memory aside as different.
Jared Hulbert wrote:
>> An alternative approach, which does not need to have struct page at
>> hand, would be to use the nopfn vm operations struct. That one would
>> have to rely on get_xip_pfn.
> Of course! Okay now I'm begining to understand.
Sorry, but I think I was educated yesterday that ->fault() (only in
-mm) is where we'd be heading. Nopfn is deprecated. But conceptually,
this does'nt change things.
>> The current path would then be deprecated.
> Why? Wouldn't both paths be valid options?
The new path via fault works in both cases with and without struct
page behind. No need to keep the old one as far as I see.
>> If you're interrested in using the later for xip without
>> struct page, I would volounteer to go ahead and implement this?
> I'm very interested in this.
Good. Let me see if I can come up with a patch on this.
> I'm not opposed to using struct page, but I'm confused as to how to
> start that. As I understand it, which is not well, defined a
> CONFIG_DISCONTIGMEM region to cover the Flash memory would add that to
> my pool of RAM. That would be 'bad', right? I don't see how to
> create the page structs and set this memory aside as different.
I fear I am not the right person to answer that question. In the good
old days before discontigmem/sparse mem/vmem map where invented we
used to have a hack for that in arch/. Heiko then decided my hack is a
mess and came up with a good solution to the problem.
so long,
Carsten
On Fri, Jun 15, 2007 at 11:22:21AM +0200, Carsten Otte wrote:
> Jared Hulbert wrote:
> >>If you're interrested in using the later for xip without
> >>struct page, I would volounteer to go ahead and implement this?
> >I'm very interested in this.
> Good. Let me see if I can come up with a patch on this.
That would be good.
> >I'm not opposed to using struct page, but I'm confused as to how to
> >start that. As I understand it, which is not well, defined a
> >CONFIG_DISCONTIGMEM region to cover the Flash memory would add that to
> >my pool of RAM. That would be 'bad', right? I don't see how to
> >create the page structs and set this memory aside as different.
>
> I fear I am not the right person to answer that question. In the good
> old days before discontigmem/sparse mem/vmem map where invented we used
> to have a hack for that in arch/. Heiko then decided my hack is a
> mess and came up with a good solution to the problem.
I didn't say your old code was a hack. It just didn't work together with
the vmem map approach, so it had to be converted. Otherwise there wouldn't
be a 1:1 kernel mapping or struct pages for your memory region.
If you can write code that doesn't need any struct pages that would make
life a bit easier, since we wouldn't need any pseudo memory hotplug code
that just adds struct pages.
We would still need to add the kernel mapping though.
Nick Piggin wrote:
> Carsten Otte wrote:
>> The current xip stack relies on having struct page behind the memory
>> segment. This causes few impact on memory management, but occupies
>> some more memory. The cramfs patch chose to modify copy on write in
>> order to deal with vmas that don't have struct page behind.
>> So far, Hugh and Linus have shown strong opposition against copy on
>> write with no struct page behind. If this implementation is acceptable
>> to the them, it seems preferable to me over wasting memory. The xip
>> stack should be modified to use this vma flag in that case.
>
> I would rather not :P
>
> We can copy on write without a struct page behind the source today, no?
> What is insufficient for the XIP code with the current COW?
I've looked at the -mm version of mm/memory.c today, with intend to
try out VM_PFNMAP for our xip mappings and replace nopage() with fault().
The thing is, I believe it does'nt work for us:
* The way we recognize those mappings is through the rules set up
* by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
* and the vm_pgoff will point to the first PFN mapped: thus every
* page that is a raw mapping will always honor the rule
*
* pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >>
PAGE_SHIFT)
This is, as far as I can tell, not true for our xip mappings. Ext2 may
spread the physical pages behind a given file all over its media. That
means, that the pfns of the pages that form a vma may be more or less
random rather than contiguous. The common memory management code
cannot tell whether or not a given page has been COW'ed.
Did I miss something?
so long,
Carsten
> If you can write code that doesn't need any struct pages that would make
> life a bit easier, since we wouldn't need any pseudo memory hotplug code
> that just adds struct pages.
That was my gut feel too. However, it seems from Carsten and J?rn
discussion of read/write XIP on Flash (and some new Phase Change)
memories that having the struct pages has a lot of potential benefits.
Wouldn't it also allow most of the mm routines to remain unchanged.
I just worry that it would be difficult to set apart these non
volitatile pages that can't be written too directly.
> We would still need to add the kernel mapping though.
But that's handled by ioremap()ing it right?
On 6/15/07, Carsten Otte <[email protected]> wrote:
> Nick Piggin wrote:
> > Carsten Otte wrote:
> >> The current xip stack relies on having struct page behind the memory
> >> segment. This causes few impact on memory management, but occupies
> >> some more memory. The cramfs patch chose to modify copy on write in
> >> order to deal with vmas that don't have struct page behind.
> >> So far, Hugh and Linus have shown strong opposition against copy on
> >> write with no struct page behind. If this implementation is acceptable
> >> to the them, it seems preferable to me over wasting memory. The xip
> >> stack should be modified to use this vma flag in that case.
> >
> > I would rather not :P
> >
> > We can copy on write without a struct page behind the source today, no?
> > What is insufficient for the XIP code with the current COW?
>
> I've looked at the -mm version of mm/memory.c today, with intend to
> try out VM_PFNMAP for our xip mappings and replace nopage() with fault().
> The thing is, I believe it does'nt work for us:
> * The way we recognize those mappings is through the rules set up
> * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
> * and the vm_pgoff will point to the first PFN mapped: thus every
> * page that is a raw mapping will always honor the rule
> *
> * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >>
> PAGE_SHIFT)
>
> This is, as far as I can tell, not true for our xip mappings. Ext2 may
> spread the physical pages behind a given file all over its media. That
> means, that the pfns of the pages that form a vma may be more or less
> random rather than contiguous. The common memory management code
> cannot tell whether or not a given page has been COW'ed.
> Did I miss something?
I agree, the conditions imposed by the remap_pfn_range() don't work.
Jared Hulbert wrote:
>> We would still need to add the kernel mapping though.
>
> But that's handled by ioremap()ing it right?
That's right. Ioremap will give you a permanent kernel space mapping
until you do iounmap() again.