[Lunar-bugs] [Lunar Linux 0000482]: massive filesystem corruption with sw raid + mount init script
Lunar bug reports list
lunar-bugs at lunar-linux.org
Wed Apr 27 19:23:51 CEST 2011
A NOTE has been added to this issue.
======================================================================
http://bugs.lunar-linux.org/view.php?id=482
======================================================================
Reported By: wdp
Assigned To: sofar
======================================================================
Project: Lunar Linux
Issue ID: 482
Category: lunar
Reproducibility: always
Severity: minor
Priority: urgent
Status: assigned
lvu installed moonbase: 20110425.13
Core Tools: Theedge
lvu installed [lunar|theedge]: 20110419
======================================================================
Date Submitted: 2011-04-27 13:44 CEST
Last Modified: 2011-04-27 19:23 CEST
======================================================================
Summary: massive filesystem corruption with sw raid + mount
init script
Description:
If you use Software Raid for your Root Device (in my example SW Raid 5) and
your software raid crashed in a way that a resync is needed (see cat
/proc/mdstat) the resync of the array will be started as soon as you boot
up - at the same time an fsck process to a previously crashed ext3/ext4
will try to repair the filesystem.
Resyncing the raid array + fscking it at the same time results in massive
filesystem corruption. We need to make sure, to _not_ fsck if the raid is
re-syncing (or to stop the resync until fsck is done)
======================================================================
----------------------------------------------------------------------
(0001097) sofar (administrator) - 2011-04-27 19:23
http://bugs.lunar-linux.org/view.php?id=482#c1097
----------------------------------------------------------------------
invalid bug, please follow my explanation:
your data is represented to fsck in the form of one disk, the /dev/md0
device.
fsck itself operates on /dev/md0. That means that any request or write that
it does while /dev/md0 is `syncing` will force the 'md' layer to resync
those blocks first, before telling fsck what is in the blocks.
no matter what block you want to access, the 'md' layer will always
represent the data properly and sync those blocks on the two platters
involved as soon as they are touched on one way or another.
so, the corruption isn't caused by fsck running. as a matter of fact, any
reconstruction operation can continue while you reformat the /dev/md0
array, or install new software or...
unless
the data on the array is already corrupted. rebuilding the array is using
broken information and the raid array has not enough information to decide
which blocks are okay and which one are not (after all, raid:1 has no
checksumming), and so all it's doing is making things more broken.
and that's what fsck sees - a terribly broken raid disk.
you basically have a corrupted array - parts of data got written to one
side of the array and not the other. the rebuild failed to preserve the
data and fsck sees that.
whether fsck runs during or after will not make a difference.
Issue History
Date Modified Username Field Change
======================================================================
2011-04-27 13:44 wdp New Issue
2011-04-27 13:44 wdp Status new => assigned
2011-04-27 13:44 wdp Assigned To => sofar
2011-04-27 13:44 wdp lvu installed moonbase => 20110425.13
2011-04-27 13:44 wdp Core Tools => Theedge
2011-04-27 13:44 wdp lvu installed [lunar|theedge] => 20110419
2011-04-27 19:23 sofar Note Added: 0001097
======================================================================
More information about the Lunar-bugs
mailing list