On Wed, Oct 28, 2009 at 10:46 PM, Dwayne Reid <dwayner@planet.eon.net> wrote:
> I recall recent discussion (and past discussions) about the problem
> of multiple writes causing eventual read failures of static eeprom
> values (static as in they don't change often or at all).

If the data doesn't change, don't initiate an EEPROM write.  Only
write it on change.  This should eliminate this concern.

However, in high reliability scenarios, five copies is perhaps
overkill, and the way you are doing the write may be prone to problems
related to power off conditions (ie, since all five bytes are being
written you may run into issues regarding the validity of data if the
write is in process when the unit loses power).

You must seperate your writes in both time and space - they should not
be immediately adjacent, and the writes should not occur at the same
time.

Further, you would be better off adopting a CRC and a counter for each
redundant value (or redundant datablock).  This gives you confidence
the value is correct (error checking), and that it's the latest value
stored (coherence).

For example:

In one high reliability, safety critical industry the most critical
information is kept in three locations, everything else is stored in
only two locations.  Each time the value is changed the counter is
incremented, the CRC calculated over both the value and the counter,
and the first redundant copy is written.  If anything fails during
this write, the second copy still has the older value, and the CRC
ensures that this one won't be used.  Once the first copy is finished
writing, a routine verifies it.  Then the counter is incremented, the
CRC recalculated, and the second copy is written and verified.  Again,
if anything fails during this write then the first copy is known good,
and the counter verifies that it's the latest.  If a third copy is
made, then the same process happens all over again.

On boot up both (or all three) values are checked to see if the CRC is
ok.  For those items where the CRC is ok, the counters are checked to
see which one is the latest, and that value is used.

During normal operation, another periodic routine goes over all the
eeprom checking CRCs and comparing values in redundant copies.  It
simply sets a flag if something goes wrong, and the module may report
the error for user replacement before the other locations go bad.
Depending on the project requirements it may either 1) not attempt to
to 'fix' anything or re-write anything, it merely reports on
inconsistencies it finds, or 2) if a CRC is bad, or values don't
match, it takes the latest good value and attempts a re-write of the
bad copy.  Options 2 is harder than option one, because you then have
to store more informaiton in EEPROM to limit the number of times you
attempt to re-write the data.  Further, if the EEPROM is going bad
it's often better to baby it and avoid writes, so fixing it may
actually exacerbate the problem.

This removes most problems with power down, bad eeprom cells, etc.  It
doesn't really perform EEPROM wear leveling, which is something you
must consider if you believe you're going to be doing more than 1k
writes to any given cell in the EEPROM over the life of the unit.
Even though the EEPROM is rated to 10k, 100k, or more, that's an MTBF
- statistically calculated, and any one cell can easily fail well
before then and still be withing the statistical curve of their
rating.  Wear leveling has to have a bit more thought put into it in
regards to infrequently changing values vs frequently changing values,
and whether redundant copies are actually spaced apart in EEPROM.

Simply copying the values 5 times and checking for correlation is a
very poor method for error and coherence checking. CRC for error
checking, and counters for coherence are a _significantly_ better
option.

But if you need to stick with the 5 bytes for whatever reason (perhaps
this is overkill for your situation), make sure the writes are done in
seperate operations, and space them out throughout the EEPROM if
possible.  In the PIC you mention it doesn't matter, but on other
devices if there are multiple EEPROM 'blocks' then put seperate copies
into different blocks of the EEPROM.

Also, you'll need to have a hard-coded fallback value if all the
copies fail their CRC checks, and your device is required to continue
operation (either full or partial) given complete EEPROM failure.  You
may need to do this in your scenario if three or four cells have
failed (and only 2 or no values match).

I hope this is useful.

-Adam
-- 
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist