Hi,
I am using ext4 as root filesystem of my TQMa28-based board with 2GB eMMC.
In case of a power failure I have to clean up the filesystem in 1.5 to 2
seconds, that's how long the caps can sustain the power.
I pass the following rootflags on the kernel cmdline: data=journal,commit=1
In my user space application I open important files with O_SYNC.
Is there something else I can or should do to avoid data corruption?
I can detect when the power fails over a GPIO line, so I close open file
descriptors in one important application but doing a "normal"
poweroff/shutdown takes too long.
What would you do if you had 1.5 seconds until the power is gone?
Maybe a read-only rootfs and a separate small data partition?
Thanks for your help.
Best regards,
Clemens
On Fri, 3 Oct 2014, Clemens Gruber wrote:
> Date: Fri, 03 Oct 2014 15:09:31 +0200
> From: Clemens Gruber <[email protected]>
> To: [email protected]
> Subject: Fast ext4 cleanup to avoid data loss after power failure
>
> Hi,
>
> I am using ext4 as root filesystem of my TQMa28-based board with 2GB eMMC.
> In case of a power failure I have to clean up the filesystem in 1.5 to 2
> seconds, that's how long the caps can sustain the power.
What exactly is the problem you're trying to solve ? Does it concern
specific application ?
>
> I pass the following rootflags on the kernel cmdline: data=journal,commit=1
> In my user space application I open important files with O_SYNC.
So what you expect to happen if the power failure happens in the
middle of the write to the eMMC ?
>
> Is there something else I can or should do to avoid data corruption?
>
> I can detect when the power fails over a GPIO line, so I close open file
> descriptors in one important application but doing a "normal"
> poweroff/shutdown takes too long.
That will help a little bit, but it's not reliable at all. Again
what do you expect to happen when the goes off in the middle of the
write to the eMMC ?
>
> What would you do if you had 1.5 seconds until the power is gone?
I'd avoid the need to deal with this at all. File system
(journal) itself will protect you from metadata corruption (file
system corruption). But the application has to protect it's own
important files for data consistency (data=journal will not help
you, nor commit=1).
The usual and simple way for the application to deal with this is to
use temporary file, fsync the changes to make sure that everything
hit the disk and then atomically rename the file to replace the
original. That way your file will always by in consistent state. It
will either have the new content, or the old one, not mix of both.
>
> Maybe a read-only rootfs and a separate small data partition?
Well, if you do not need to write to the rootfs why you need to deal
with data corruption ?
Regards,
-Lukas
>
> Thanks for your help.
>
> Best regards,
> Clemens
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On 10/03/2014 04:08 PM, Luk?? Czerner wrote:
> What exactly is the problem you're trying to solve ? Does it concern
> specific application ?
It's a software to control and configure dispensing equipment in a bar.
The problem is that power is lost frequently and the only warning is the
mentioned GPIO about 1.5 to 2 seconds in advance, then the caps are drained.
This happens very often and I have to prevent it from damaging the
filesystem.
I did not mention it before, because I was not sure if it is relevant,
but I am running Linux 3.17-rc7 on the board.
> So what you expect to happen if the power failure happens in the
> middle of the write to the eMMC ?
With the 1.5 second delay, I'd like to stop the application, before that
happens.
> I'd avoid the need to deal with this at all. File system
> (journal) itself will protect you from metadata corruption (file
> system corruption). But the application has to protect it's own
> important files for data consistency (data=journal will not help
> you, nor commit=1).
>
> The usual and simple way for the application to deal with this is to
> use temporary file, fsync the changes to make sure that everything
> hit the disk and then atomically rename the file to replace the
> original. That way your file will always by in consistent state. It
> will either have the new content, or the old one, not mix of both.
Thank you, this approach sounds good! I will change the application
accordingly.
So the only necessary step to do when the GPIO triggers, is to quit the
applications writing to the eMMC. If I use that write, fsync, rename
strategy, I guess I could even SIGKILL them.
And you would keep the default values for the commit and data flag,
because fsync flushes the buffers for data and metadata anyway?
Forcing fsck for the next boot, when I detect that a power-failure is
imminent, is also not necessary, right?
>
>>
>> Maybe a read-only rootfs and a separate small data partition?
>
> Well, if you do not need to write to the rootfs why you need to deal
> with data corruption ?
At the moment, my root filesystem is writable. I would need to change a
lot, so I'd rather find a solution to keep it writable and avoid
corruption / ensure durability.
Regards,
Clemens
On 10/3/14 10:39 AM, Clemens Gruber wrote:
> On 10/03/2014 04:08 PM, Luk?? Czerner wrote:
>> What exactly is the problem you're trying to solve ? Does it concern
>> specific application ?
>
> It's a software to control and configure dispensing equipment in a bar.
> The problem is that power is lost frequently and the only warning is the
> mentioned GPIO about 1.5 to 2 seconds in advance, then the caps are drained.
> This happens very often and I have to prevent it from damaging the
> filesystem.
>
> I did not mention it before, because I was not sure if it is relevant,
> but I am running Linux 3.17-rc7 on the board.
>
>> So what you expect to happen if the power failure happens in the
>> middle of the write to the eMMC ?
>
> With the 1.5 second delay, I'd like to stop the application, before that
> happens.
>
>> I'd avoid the need to deal with this at all. File system
>> (journal) itself will protect you from metadata corruption (file
>> system corruption). But the application has to protect it's own
>> important files for data consistency (data=journal will not help
>> you, nor commit=1).
>>
>> The usual and simple way for the application to deal with this is to
>> use temporary file, fsync the changes to make sure that everything
>> hit the disk and then atomically rename the file to replace the
>> original. That way your file will always by in consistent state. It
>> will either have the new content, or the old one, not mix of both.
>
> Thank you, this approach sounds good! I will change the application
> accordingly.
> So the only necessary step to do when the GPIO triggers, is to quit the
> applications writing to the eMMC. If I use that write, fsync, rename
> strategy, I guess I could even SIGKILL them.
http://lwn.net/Articles/457667/
is a good overview of data persistence best practices, FWIW.
-Eric
So long as you aren't dilly-dallying, 1.5 seconds is an huge amount of
time so long as the application doesn't have to write a huge amount of
state. What I would do is to give the application a second time
budget to shutdown, then use the FIFREEZE ioctl to lock the file
system into a consistent state, and then wait for the end to come.
More generally, it's a good idea to seriously control how much stuff
you are writing to the eMMC, and to question whether any of it is
necessary, and if it isn't to ruthless cut it out.
In addition to the write everything to a temporary file, and the fsync
the changes, and then use an atomic rename to replace the original
file, an additional design pattern you can use is an application level
journal. Just append each thing that needs to be saved to an
application journal file: "dispensed one shot of whisky"; "entering
light pour mode for happy hour", etc, with an fsync() after each write
to the application. This minimizes the amount of writes you need for
each application update, and then periodically, you can dump all of
the state out to the temp file, rename it, and then truncate the
application journal.
If you crash, then it's simply a matter of replaying the application
journal log into your application state when you start up again.
One important thing to remember is that most eMMC are not protected
against power failure. So if you are writing to the eMMC flash when
the power finally fails, the flash translation metadata can get
corrupted, and you can lose all of your data, or some of your data,
but it will not be under your control at all.
Hence my recommendation to give your application a one second time
budget to quiesce itself, and then to use FIFREEZE --- that way you
don't have to optimize your system daemons from having a fast shutdown
sequence. Or you can have your emergency shutdown program send a kill
-9 to all processes except itself and init, and then unmount the file
system --- but FIFREEZE might be easier. :-)
Cheers,
- Ted
P.S. And keeping a read-only root and only having writable state in
small writable partition is also a good idea. For bonus points, keep
*two* copies of the read-only root, so you can update one of the
roots, reboot into it, and if you can successfuly reboot into it, only
then do you update the other read-only root.
Thank you very much for your thorough response, Ted! It helped me out a
lot. Also big thanks to Luk?? and Eric!
At the moment I am changing every application state update to an atomic
"fwrite to tempfile, fsync (or fdatasync) tempfile, rename, fsync parent
dir".
As the amount of writes I need for application updates is usually small
(1-3 fwrites), I decided to update the application state file without an
extra journal log for now. But if the amount of necessary writes grows,
then I'll introduce a journal log file to minimize the writes needed for
each application state update.
Or would you consider adding an application level journal anyway?
So after following this design pattern for the application and locking
the filesystem with FIFREEZE before the power fails, does it matter what
flags I set for the ext4 filesystem?
Should I stay with the default settings data=ordered and commit=5?
Regards,
Clemens