=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    Date: Tue, 25 Jan 2000  09:30:13
    From: Nikolai Golovchenko 
      To: pic microcontroller discussion list 
 Subject: Re: EEPROM endurance/error correction
--------------------------------------------------------------------------------

On Monday, January 24, 2000 Roland Andrag wrote:
> Hello everyone!

> A thread fairly similar to the one I'm about to (hopefully) start came up a
> few weeks ago, but I want to pose the same question in a more open way. So
> here goes:
> What is the best way to check for/detect failure of an EEPROM location when
> the endurance limit is reached, and then move on to another location? My
> chain of thought leads me to something like:
> 1. Have a pointer (in EEPROM) pointing to where the variable of interest is
> repeatedly stored;

  Bad  idea.  If  you  assume  EEPROM  failure, then no pointer can be
  stored  in  there. You have to know that the area in EEPROM that you
  read has no errors. Simple CRC check can help.

> 2. Store the variable a couple of times in successive locations (say three
> times);

  AFAIK, EEPROM fails because of writes. So chances are that all these
  succesive locations will reach their write limit at about the same
  time.

> 3. When reading, if all the stored values (all three in line 2) do not
> agree, use the value given by the majority;
> 4. Once all three positions do not agree, move on to three new locations and
> update the pointer.
> At first I considered not mentioning this chain of thought so as not to
> influence anyone elses ideas, but did so since I would like comments on it.
> So if you have a different/better idea, please mention it!

> Thanks,
>  Roland

  So I think that you have to break the EEPROM into several banks with
  only one bank used at a time until it's no more usable because of
  EEPROM write failure. Then another bank is selected and so on. When
  all banks are bad (or better earlier) then PIC has to signal for
  maintainance. This is similar to what Dwaine Reed described a while
  ago. The current bank selection is made on power-up and the bank
  number is stored in RAM.

  Each bank should have two parts (like two FATs). Each part will have
  a  simple  CRC  (XORing  all  data  bytes will do). Having two parts
  protects from power-down during writes.
  
  This scheme consists of the write stage and power-up initialization.

  1)WRITE STAGE.
  There are two problems that have to be dealt with: (a)write error
  due to EEPROM failure and (b)power-down during write.

  As manual says, "write error" can be detected by EECON1 flag.
  The flag gets set when write is interrupted by reset (MCLR of
  Watchdog). As I understand, this flag is useful for dealing with
  brownout, and watchdog can signal that the write operation is too
  long (very vague, because watchdog frequency depends heavily on
  temperature, device). So WRERR gives no useful information in (a)
  and (b) cases.

  Instead the written data should be verified. If the data don't match
  then there is error and the bank should be cancelled. Both CRC codes
  for the bank must not match the corresponding parts to decide on
  power-up that the bank is bad.

  (a) Steps for write:
      1) Write needed data into the first part of current bank.
      2) Verify each byte. If no match then cancel the bank, then copy
      other  (still  good) part into new bank and switch to this bank,
      repeat from 1.
      3) Compute and write CRC to the part.
      4) Copy the first part into second one.
  (b)  If power-down occurs any time at (a) then one part's CRC will
  always be right and another is wrong.

  2)POWER-UP INITIALIZATION.
  The  purpose of this routine is to find the current bank and restore
  it if there was power-down during write.
  Bad  bank will have bad CRC codes for both its parts. Good bank will
  have  at  least  one  good CRC. If only one CRC is good for the bank
  then copy this part into other to restore the bank.

I   haven't   had  a  chance  yet  to  implement  all  this,  since for  my
applications ii was safe to assume that the EEPROM doesn't failure for
the device life time.

Bye,
 Nikolai