[Lunar-bugs] [Lunar Linux 0000482]: massive filesystem corruption with sw raid + mount init script

Wed Apr 27 19:23:51 CEST 2011

A NOTE has been added to this issue. 
====================================================================== 
http://bugs.lunar-linux.org/view.php?id=482 
====================================================================== 
Reported By:                wdp
Assigned To:                sofar
====================================================================== 
Project:                    Lunar Linux
Issue ID:                   482
Category:                   lunar
Reproducibility:            always
Severity:                   minor
Priority:                   urgent
Status:                     assigned
lvu installed moonbase:     20110425.13 
Core Tools:                 Theedge 
lvu installed [lunar|theedge]: 20110419 
====================================================================== 
Date Submitted:             2011-04-27 13:44 CEST
Last Modified:              2011-04-27 19:23 CEST
====================================================================== 
Summary:                    massive filesystem corruption with sw raid + mount
init script
Description: 
If you use Software Raid for your Root Device (in my example SW Raid 5) and
your software raid crashed in a way that a resync is needed (see cat
/proc/mdstat) the resync of the array will be started as soon as you boot
up - at the same time an fsck process to a previously crashed ext3/ext4
will try to repair the filesystem.

Resyncing the raid array + fscking it at the same time results in massive
filesystem corruption. We need to make sure, to _not_ fsck if the raid is
re-syncing (or to stop the resync until fsck is done)
====================================================================== 

---------------------------------------------------------------------- 
 (0001097) sofar (administrator) - 2011-04-27 19:23
 http://bugs.lunar-linux.org/view.php?id=482#c1097 
---------------------------------------------------------------------- 
invalid bug, please follow my explanation:

your data is represented to fsck in the form of one disk, the /dev/md0
device.

fsck itself operates on /dev/md0. That means that any request or write that
it does while /dev/md0 is `syncing` will force the 'md' layer to resync
those blocks first, before telling fsck what is in the blocks.

no matter what block you want to access, the 'md' layer will always
represent the data properly and sync those blocks on the two platters
involved as soon as they are touched on one way or another.

so, the corruption isn't caused by fsck running. as a matter of fact, any
reconstruction operation can continue while you reformat the /dev/md0
array, or install new software or...

unless

the data on the array is already corrupted. rebuilding the array is using
broken information and the raid array has not enough information to decide
which blocks are okay and which one are not (after all, raid:1 has no
checksumming), and so all it's doing is making things more broken.

and that's what fsck sees - a terribly broken raid disk.

you basically have a corrupted array - parts of data got written to one
side of the array and not the other. the rebuild failed to preserve the
data and fsck sees that.

whether fsck runs during or after will not make a difference. 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2011-04-27 13:44 wdp            New Issue                                    
2011-04-27 13:44 wdp            Status                   new => assigned     
2011-04-27 13:44 wdp            Assigned To               => sofar           
2011-04-27 13:44 wdp            lvu installed moonbase    => 20110425.13     
2011-04-27 13:44 wdp            Core Tools                => Theedge         
2011-04-27 13:44 wdp            lvu installed [lunar|theedge] => 20110419       

2011-04-27 19:23 sofar          Note Added: 0001097                          
======================================================================