

# PoS

# Radiation tolerance of an SRAM based FPGA used in a large tracking detector

Ketil Røed<sup>\*†</sup> CERN CH-1211, Genève 23, Switzerland E-mail: ketil.roeed@cern.ch

Johan Alme, Dominik Fehlker, Matthias Richter, Kjetil Ullaland, Dieter Röhrich University of Bergen, P.O. Box 7803, 5020 Bergen, Norway

## Håvard Helstrup

Bergen University College, P.O. Box 7030, Nygårdsgaten 112 5020 Bergen, Norway

A reconfiguration solution has been implemented in order to correct Single Event Upsets (SEUs) in the configuration memory of an FPGA. This FPGA is in charge of data readout of a large tracking detector and will operate in a radiation exposed environment. It is therefore important to reduce the probability of experiencing functional failures caused by SEUs. Correct implementation of the reconfiguration solution is validated by irradiation and fault injection tests. The results of these tests are presented along with tests that show how this solution can be combined with an additional mitigation approach to effectively reduce the functional failure rate.

9th International Conference on Large Scale Applications and Radiation Hardness of Semiconductor Detectors,RD09 September 30-October 2, 2009 Florence,Italy

\*Speaker.

http://pos.sissa.it/

<sup>&</sup>lt;sup>†</sup>Thanks to Jon Wikne and Evind Olsen at the Oslo Cyclotron for valuable help and support during the irradiation test campaign

## 1. Introduction

The ALICE experiment [1] is an experiment at the Large Hadron Collider (LHC) where high energy beams of particles (Lead-Lead and proton-proton) will be collided. These collisions give rise to a high primary particle production rate which further produce secondaries through hadronic and electromagnetic cascades in the structural elements of ALICE [1]. The result is particle fluxes that may pose a reliability risk to the front-end electronics in the main tracking detector of ALICE, the Time Projection Chamber (TPC). An important node in the TPC readout electronics is the Readout Control Unit [2] (RCU). It uses a Xilinx Virtex-II Pro 7 Field Programmable Gate Array (FPGA) for data readout, hereafter called the RCU main FPGA. A major drawback of Static Random Access Memory (SRAM) based FPGAs is their susceptibility to radiation induced effects [3], in particular Single Event Upsets (SEUs).

An SEU is defined by the JEDEC standard [4] as a soft error caused by the transient signal induced by a single energetic particle strike. Essentially an SEU refers to any type of memory cell whose content or value has been changed into an erroneous state due to an ionizing radiation event. In an FPGA, a function is implemented by mapping it into a matrix of programmable logic blocks controlled by SRAM memmory cells. An SEU in any of these SRAM cells can lead to a variety of undesirable effects and consequently cause a malfunction in the operation of the FPGA. This can potentially interrupt the readout functionality of detector data [5], which should be avoided if possible.

#### 1.1 Radiation environment and failure rate prediction

The SEU rate of an FPGA can be calculated using equation 1.1,

$$R_{SEU} = N_{bit} \cdot \sigma_{(SEU, bit)} \cdot \phi_{Hadrons > Eth}, \tag{1.1}$$

where  $N_{bit}$  is the number of configuration bits in the FPGA,  $\sigma_{(SEU,bit)}$  is the SEU cross section per bit, typically for a neutron or proton kinetic energy above 60 MeV [6], and  $\phi_{Hadrons>Eth}$  is the hadron flux above a given kinetic energy threshold. This somewhat simplified picture is valid because the radiation environment in ALICE mainly consist of energetic hadrons [7]. At these energies hadrons show approximately the same effectiveness in causing SEUs [8].

Depending on how a given memory cell is utilized by the system, an SEU may or may not result in a detectable malfunction of that system, hereafter referred to as a functional or operational failure. Infact, only 1 out of every 10-40 configuration memory bits are utilized in a typical design. When studying the impacts of SEUs, Xilinx refers to this number as the single event upset probability impact (SEUPI). Dividing the SEU rate by the SEUPI number will consequently give the expected number of functional failures. This scaling factor, which is highly dependent on the implemented design, can be derived through accelerated beam tests or fault injection as described in section 3. In cases where the SEUPI scaling may be unknown, Xilinx recommends a conservative factor of 10.

Simulations in [7] show fluxes ranging from approximately 100-200 hadrons/(cm<sup>2</sup>s) according to the location of the different RCUs. Summing over all the 216 RCUs, it is expected that 42 SEUs may occur during a four hour data taking run. Applying a conservative SEUPI scaling factor of 10, one can therefore expect to see 1 functional failure every hour.

## 2. System overview

Because the RCU main FPGA is in charge of data readout for the TPC detector, it is important that this FPGA is kept in operational status during a data taking run. A reconfiguration network has therefore been implemented to repair SEUs in the configuration memory of the RCU main FPGA through the use of partial reconfiguration [11]. This method allows to read back a subset of the configuration memory checking for corrupted bit locations. If an erroneous bit value is detected, it is corrected by rewriting the correct value to this bit location. The smallest portion of data that can be read from or written to the configuration memory in one operation is called a frame. Therefore, the reconfiguration solution described above is referred to as Frame by frame Readback, Verification and Correction (FRVC). An important feature of this solution is that it can run continuously without interrupting the operation of the RCU main FPGA. This will allow to correct SEUs in the configuration memory of the RCU main FPGA during normal operation.

The RCU consists of the RCU motherboard, the Detector Control System board (DCS) and the Source Interface Unit (SIU). In total 216 nodes are present in the system. A simplified sketch of the RCU can be seen in figure 1(a) and more detailed information can be found in [2], [12] and [13].



**Figure 1:** (a) Conceptual schematic of the RCU motherboard. The data path of the system is marked with black arrows. As can be seen it passes through the Xilinx Virtex-II Pro FPGA. (b) Conceptual model showing the different modes of operation for the RCU system

### 2.1 Reconfiguration network

The main parts of the reconfiguration network are the DCS embedded computer, an Actel flash based support FPGA, and a flash memory device. In a radiation environment mainly dominated by energetic hadrons, memory devices based on flash technology are considered inherently SEU tolerant [14]. The support FPGA is in charge of all low level control of the reconfiguration network while the flash memory device stores all needed configuration files for the RCU main FPGA. These devices, including the Xilinx FPGA, are accessible from the DCS embedded computer through the DCS bus interface as shown in Fig. 1(a) and 1(b). The DCS bus is controlled by the DCS embedded computer and supports three modes of operation: flash mode, SelectMAP mode and normal mode.

In SelectMAP and flash modes, the control logic of the support FPGA is effectively bypassed. This enables a direct access from the DCS embedded computer to the configuration memory of the RCU main FPGA or to the flash memory. Flash mode is used to upload configuration files and individual data frames for storage on the flash memory device.

In normal mode the DCS bus operates in a memory mapped fashion. Commands can be issued from the DCS embedded computer to the configuration controller running on the support FPGA. Besides initial configuration on power up, the main task of the support FPGA is to carry out the FRVC process. During FRVC each configuration frame is read back from the RCU main FPGA and verified bit by bit with the original frame stored on the flash memory. If different, the support FPGA reconfigures only the frame containing corrupt data. Running one cycle of FRVC includes sequentially reading back all frames in the configuration memory once. FRVC can be run as a single cycle or in a continuous mode until stopped by DCS software.

Scrubbing is an alternative to the FRVC process. Like FRVC it also reconfigures the RCU main FPGA without interrupting the operation of the device. However, compared to FRVC it reconfigures the full device in one operation and can neither address individual frames nor count the number of SEUs. FRVC is therefore the preferred correction mechanism.

## 2.2 Fault injection

While the main task of the support FPGA is to detect and correct any SEUs that occur in the configuration memory of the RCU main FPGA, the reconfiguration network can also be used to write incorrect data to the configuration memory. This is referred to as fault injection. It allows to inject errors in the configuration memory with the purpose of studying any effects this may have on the operation of the RCU main FPGA. All available bits in the configuration memory can be systematically flipped one by one. This can for instance be used to map the sensitive bits of a design, which again corresponds to the SEUPI scaling factor described in section 1.1.

Due to the complexity of the readout electronics, irradiation testing is technically difficult when considering the system as a whole. Fault injection was therefore implemented as an alternative test that can be used to characterize the failure signatures of the final RCU main FPGA design. Additionally it also proved a valuable tool for the validation tests described in section 3.

#### 3. Validation test results

To validate a correct implementation of the reconfiguration network and its potential effectivness in reducing the functional failure rate, irradiation tests have been carried out in a 29 MeV proton beam at the Oslo Cyclotron (OCL), at the University of Oslo in Norway. Fault injection tests have also been carried out and compared to the irradiation test.

#### 3.1 Measurement of configuration times

The configuration speed is essentially limited by the access time of the flash memory and the implementation details of the configuration controller. In [7] a worst case number of  $2.4 \cdot 10^{-5}$  SEUs/(FPGA s) is reported for the expected SEU rate. To sufficiently reduce the probability for accumulation of SEUs, Xilinx [15] recommends that the reconfiguration rate should be at least an order of magnitude above the upset rate. As can be seen from table 1, a reconfiguration frequency

| Operation              | Time   | Frequency |
|------------------------|--------|-----------|
| Initial configuration  | 113 ms | -         |
| Scrubbing              | 77 ms  | 13 Hz     |
| Read one frame         | 163 µs | -         |
| Write one frame        | 180 µs | -         |
| Read all frames (FRVC) | 150 ms | 6.6 Hz    |

**Table 1:** Measured times for the different configuration procedures. Note that the time of the scrubbing is dependent on the design, as the scrubbing file is in compressed format. Frequency is given for procedures that is meant to run continuous operation.

of 6.6 Hz for the FRVC procedure is well within this recommendation.

#### 3.2 SEU cross section measurements

Based on 61 individual measurements the SEU cross section was calculated to be  $2.1 \cdot 10^{-14}$  cm<sup>2</sup>/bit. The result is shown in the histogram of figure 2. Beam fluxes in the order of  $10^6 - 10^7$  particles/(cm<sup>2</sup> s) were used and measured using Thin Film Breakdown Counters (TFBC) [16][17]. During irradiation a scintillator, pre-calibrated with the TFBC, was used as relative beam monitor. The SEU



Figure 2: Histogram showing distribution of the measured SEU cross sections.

cross section of the Xilinx Virtex-II Pro has also been measured by others and reported in literature to be  $3.6 \cdot 10^{-14}$  cm<sup>2</sup>/bit for 63.3 MeV protons [18] and  $2.98 \cdot 10^{-14}$  cm<sup>2</sup>/bit for an atmospheric (Hess) neutron spectrum (E<sub>kin</sub> >10 MeV) [10]. This discrepancy is mainly explained by the low energy proton beam used at OCL. At 29 MeV the protons are close to the threshold for inducing non-elastic nuclear interactions. Additionally the proton kinetic energy is slightly attenuated as it has to penetrate the Copper lid and Silicon substrate of the FPGA before reaching the sensitive area. Even if the results at 29 MeV are less than a factor 2 lower than at 63.3 MeV, a 29 MeV beam is still more than capable of inducing acceptable upset rates for the validation tests.

## 3.2.1 RCU main FPGA shift register test design

The FPGA test design used during both the irradiation and the fault injection tests was a basic shift register extended with a configurable Triple Modular Redundancy (TMR) solution. TMR is a commonly used mitigation for FPGAs where three identical copies of the logic operates in parallel [19][20]. A majority voter is placed at the output to identify the correct value. Configurable TMR means that the design is extended with a simple multiplexer that can either "turn off" TMR by forwarding the output directly from the primary shift register, or "turn on" TMR by forwarding the output from the voter.

#### 3.2.2 Test strategy

The purpose of the irradiation and fault injection tests was twofold. During normal operation the main task of the reconfiguration network is to correct SEUs in the configuration memory of the RCU main FPGA. Thus, the first objective of the validation tests was to confirm that the reconfiguration network is capable of both detecting and correcting SEUs through the method of partial reconfiguration. The second objective was to demonstrate how the reconfiguration network can be used to reduce the expected failure rate of the RCU main FPGA design. A special test strategy was adopted as illustrated in figure 3. It serves the purpose of studying how three different mitigation scenarios will affect the operation of the RCU main FPGA design. The validation test is divided



Figure 3: Flow diagram of the validation test procedure for the shift register design.

into three periods of equal length. During the first period neither the reconfiguration nor the TMR mitigation option is enabled. As a consequence, SEUs will accumulate in the configuration memory. For the second and intermediate period, reconfiguration is activated by enabling continuous cycles of FRVC. During the very first cycle of FRVC, the number of accumulated SEUs will be counted and corrected by the RCU support FPGA. Any errors in the readout of the shift register will consequently be corrected as well. In case an SEU should occur during the second period, it may still cause a temporary failure in operation until it is corrected by the following cycle of FRVC. For the final period both reconfiguration and TMR is enabled at the same time. The outputs of the

shift register are now fed through the majority voter and compared to the outputs of two identical shift registers. An operational failure in one of the shift register chains should now be masked out by the majority voter. During all periods a checkerboard data pattern is shifted through the shift register and continuously monitored for any unexpected values. A persistent difference in the data is registered as an operational failure caused by an SEU in the configuration memory.

#### 3.2.3 Results

During the validation tests SEUs or errors are forced into the configuration memory of the RCU main FPGA either through proton beam irradiation or fault injection. A special shift register output error plot is produced to study the behaviour of the shift register during the tests. Figures 4(a) and 4(b) show examples of this plot for the irradiation test and fault injection test respectively. An erroneous output is indicated by an entry (black dot) in the plot where a persistent error wil appear as a continuous black line. During the first period of 200 seconds the number of erroneous outputs increase with time. When the FRVC procedure is enabled after 200 second, the erroneous outputs are removed as the accumulated SEUs are corrected in the configuration memory. This onset of the reconfiguration procedure is clearly seen in figures 4(c) and 4(d) where the number of detected SEUs is plotted as a function of time. For the irradiation test case the scintillator counts also nicely demonstrates the linear dependence between the beam fluence and the number of SEUs. Even though the SEUs are continuously corrected, the output error plot still indicates failures in the shift register. This occur because the checkerboard pattern is shifted through the shift register with a frequency of 100 Hz, which is faster than the reconfiguration frequency of 6 Hz.

The effect of combining reconfiguration with a design level mitigation technique, TMR, is demonstrated for the last 200 seconds of the test period. Ideally the shift register should now show no signs of failure. The reason why some failures still are seen is explained mainly by the simplistic approach taken when implementing the TMR. Only a minimum of effort was put into optimizing the TMR design as the main purpose of the tests was to demonstrate correct behaviour of the reconfiguration solution. The TMR part of the design is placed and routed interwoven with the shift register test design. Additionally all three shift register chains share the same clock tree. A single configuration bit may therefore control resources connected to more than one replica of the shift register, hence resulting in the voter taking the wrong decision. Nevertheless, this simple approach allowed to quantify the effect of the different mitigation approaches by comparing the average number of SEUs needed to cause a functional failure. The results are listed in tables 2 and 3 for the irradiation test and fault injection test respectively. During fault injection bit flips where injected in random locations for best possible comparison to the irradiation test results. Also, fault injection allows to study the effect of the TMR alone since the number of injected bit flips is easily accounted for. First of all the results show how well fault injection reproduces the irradiation test results. Fault injection can therefore be used as an alternative test method for failure characterization of the final RCU main FPGA design. Furthermore, it can be seen that reconfiguration alone does not improve the failure probability. Only when combined with an additional design level mitigation approach a reduction is seen in the number of SEUs needed to cause a functional failure. The factor 4 improvement is expected to be further increased by a more optimized TMR design. It is important to notice that the number of SEUs needed to cause a functional failure is highly dependent on the details of the implemented design. The results of the shift register design are not expected to





**Figure 4:** The plot shows example results from the irradiation test (left column) and the fault injection test (right column) of the RCU main FPGA shift register design. (a,b) error plot of the shift register output where a black dot indicates an erroneous output. (c) the normalized values of the scintillator used as a relative beam monitor and the SEU counts during the irradiation testing. (d) the SEU counts detected during fault injection. (e,f) the respective current consumption of the RCU during irradiation testing and fault injection. For the first 200 s the shift register is run without FRVC and TMR. From 200-400 s FRVC is enabled. From 400-600 s both FRVC and TMR is enabled. The slighly less activity for the fault injection test is explained by lower SEU rate of 0.9 injected bit flips per second compared to 2.4 SEUs/s for the irradiation tests.

| Run Type      | SEUs          | FF   | Average SEU/FF |
|---------------|---------------|------|----------------|
| No mitigation | $2628{\pm}51$ | 71±8 | 37±4           |
| FRVC          | 2632±51       | 67±8 | 39±5           |
| FRVC + TMR    | 2599±51       | 14±4 | $186{\pm}50$   |

**Table 2:** The average number of SEUs needed to induce a functional failure (FF) for three different mitigation scenarios. Uncertainties are given as counting statistics only.

| Run Type      | Injected bit flips | FF   | Average SEU/FF |
|---------------|--------------------|------|----------------|
| No mitigation | 95683              | 2600 | 37±1           |
| FRVC          | 98056              | 2600 | 38±1           |
| TMR           | 235337             | 2600 | 91±2           |
| FRVC + TMR    | 404391             | 2600 | 156±3          |

**Table 3:** The average number of injected bit flips needed to induce a functional failure (FF) for four different mitigation scenarios. Uncertainties are given as counting statistics only.

represent the results of the final RCU main FPGA design. This can only be achieved through testing using the final RCU main FPGA design which was not available at the time of these validation tests.

During irradiation testing, when accumulating SEUs, the current consumption of the RCU was observed to increase. Immediately after reconfiguration was enabled, it dropped down to its initial level as can be seen in figure 4(e). This is an effect also reported in [21] where it is explained to be due to internal contention by the accumulated upsets. Since not all the configuration bits may be used for a given design, a number of bits remains "unprogrammed". In case these bits are "programmed" by an SEU, some of them might for instance connect the clock tree to unused logic and induce more activity which again increases the current consumption. When corrected by a reconfiguration, this unwanted activity is removed and the current consumption is reduced to its initial level. If the current consumption did not decrease after a reconfiguration, this could indicate a Single Event Latch-up. This situation was not observed during the two test periods.

## 4. Conclusion

A reconfiguration solution has been implemented to correct SEUs in the configuration memory of an FPGA in charge of data readout for a large tracking detector. Correct operation of the reconfiguration network has been validate through both irradiation testing and software based fault injection testing. It was also shown that in order to reduce the functional failure rate of the FPGA, partial reconfiguration must be combined with a design level mitigation technique like for example TMR. This may however significantly impact the usage of logical resources in the FPGA. A careful investigation is therefore needed to identify the critical parts of the design that should be protected. The fault injection tests were also capable of reproducing the irradiation test result. It is therefore likely that fault injection can be used to study failure signatures and predict the expected functional failure rate of the final RCU main FPGA design. In addition, fault injection can be used to measure the effect of an implemented mitigation technique and thereby contribute to improve the radiation tolerance of the final design.

#### References

- [1] The ALICE Collaboration, K. Aamodt, and et al. ALICE Experiment at the CERN LHC. *IOP Journal of Instrumentation*, JINST 3 S08002, 2008.
- [2] C. G. Gutiérrez, R. Campagnolo, A. Junique, L. Musa, J. Alme, J. Lien, B. Pommersche, M. Richter, K. Røed, D. Röhrich, K. Ullaland, and T. Alt. The ALICE TPC readout control unit. *Nuclear Science Symposium Conference Record*, 2005 IEEE, 1:575–579, Oct. 2005.
- [3] Xilinx, Inc. Correcting Single-Event Upsets in Virtex-4 Platform FPGA Configuration Memory, xapp989 v1.0 edition, March. 2008.
- [4] JEDEC STANDARD: Measurement and Reporting of Alpha Particle and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices. Technical report, JEDEC Solid State Technology Association, Arlington, VA 22201-3834, Revision of JESD89, Aug. 2001.
- [5] Johan Alme. Firmware Development and Integration for ALICE TPC and PHOS Front-end Electroncis. PhD thesis, University of Bergen, Bergen, Norway, 2008.
- [6] M. Huhtinen and F. Faccio. Computational method to estimate single event upset rates in an accelerator environment. *Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment*, 450(1):155 – 172, 2000.
- [7] Ketil Røed. Single Event Upsets in SRAM FPGA based readout electronics for the Time Projection Chamber in the ALICE experiment. PhD thesis, University of Bergen, 2009.
- [8] H.H.K. Tang. Nuclear physics of cosmic ray interaction with semiconductor materials: Particle-induced soft errors from a physicists perspective. *IBM Journal of Research and Development*, 40(1):2162–2167, 1996.
- [9] J.J. Fabula. The NSEU response of static latch based FPGAs. Presented at the Military and Aerospace Programmable Logic Devices (MAPLD) conference, Apr 2003.
- [10] A. Lesea, S. Drimer, J.J. Fabula, C. Carmichael, and P. Alfke. The Rosetta experiment: atmospheric soft error rate testing in differing technology FPGAs. *Device and Materials Reliability, IEEE Transactions on*, 5(3):317–328, Sept. 2005.
- [11] Xilinx, Inc. Virtex-II Pro and Virtex-II Pro X FPGA User Guide, ug012 v4.2 edition, Nov. 2007.
- [12] M. Richter, J. Alme, T. Alt, S. Bablok, R. Campagnolo, U. Frankenfeld, C.G. Gutierrez, R. Keidel, Ch. Kofler, T. Krawutschke, D. Larsen, V. Lindenstruth, B. Mota, L. Musa, K. Røed, D. Röhrich, M.R. Stockmeier, H. Tilsner, and K. Ullaland. The control system for the front-end electronics of the ALICE time projection chamber. *Nuclear Science, IEEE Transactions on*, 53(3):980–985, June 2006.
- [13] L. Musa, J. Baechler, N. Bialas, R. Bramm, R. Campagnolo, C. Engster, F. Formenti, U. Bonnes, R. Esteve Bosch, U. Frankenfeld, P. Glassel, C. Gonzales, H.-A. Gustafsson, A. Jimenez, A. Junique, J. Lien, V. Lindenstruth, B. Mota, P. Braun-Munzinger, H. Oeschler, L. Osterman, R. Renfordt, G. Ruschmann, D. Röhrich, H.-R. Schmidt, J. Stachel, A.-K. Soltveit, and K. Ullaland. The ALICE TPC front end electronics. *Nuclear Science Symposium Conference Record, 2003 IEEE*, 5:3647–3651 Vol.5, Oct. 2003.
- [14] Actel Corporation. APA750 and A54SX32A LANSCE Neutron Test Report, white paper edition, Dec. 2003.
- [15] Xilinx, Inc. Correcting Single-Event Upsets Through Virtex Partial Configuration, xapp216 v1.0 edition, June 2000.

- [16] V.P. Eismont, A.V. Prokofiev, and A.N. Smirnov. THIN-FILM BREAKDOWN COUNTERS AND THEIR APPLICATIONS(REVIEW). *Radiation Measurements*, 25(1-4):1151–156, 1995.
- [17] A.V. Prokofiev, A.N. Smirnov, and P-U Renberg. A Monitor of Intermediate-Energy Neutrons Based on Thin Film Breakdown Counters. Technical report, The Svedberg Laboratory and Department of Radiation Science, Uppsala, Sweden, 1999.
- [18] H. Quinn, P. Graham, J. Krone, M. Caffrey, and S. Rezgui. Radiation-induced Multi-Bit Upsets in SRAM-Based FPGAs. *Nuclear Science, IEEE Transactions on*, 52(6):2455–2461, Dec. 2005.
- [19] Fernanda Lima Karstensmidt. SEE Mitigation Strategies for Digital Circuit Design Applicable to ASIC and FPGAs. Nuclear and Space Radiation Effects Conference, Short Course Notebook, July 2007.
- [20] Philippe Adell and Greg Allan. Assessing and Mitigating Radiation Effects in Xilinx FPGAs. Technical Report JPL publication 08-9 2/08, Jet Propulsion Laboratory California Institute of Technology, Pasedena, California, 2008.
- [21] E. Fuller, P. Blain, M. Caffrey, and C. Carmichael. Radiation Test Results of the Virtex FPGA and ZBT SRAM for Space Based Reconfigurable Computing. *Proc. MAPLD*, Sept. 1999.