

# Triple-Modular Redundancy Deployment Optimization in the Sensor Readout System of the CBM Micro Vertex Detector

Y. Zhao,<sup>a1</sup> A. Himmi,<sup>a</sup> F. Morel,<sup>a</sup> C. Hu-Guo,<sup>a</sup> J. Baudot,<sup>a</sup> and Y. Hu<sup>a</sup>

<sup>a</sup> Institut Pluridisciplinaire Hubert Curien, 23 rue du læss 67037 Strasbourg, France E-mail: yue.zhao@iphc.cnrs.fr

This paper describes the deployment and optimization process of triple-module redundancy (TMR) under high design constraints against single-event upset (SEU) and single-event transient (SET). It includes modelling of single-event effect (SEE) pulses with TCAD mesh model, TMR deployment strategies, and verification methods. The simulation result shows that the prototype system with optimized TMR deployment has high reliability with respect to design requirements. The system can run for more than 5 years without critical errors with an equivalent error rate in the working environment is lower than 10<sup>-9</sup>.

Topical Workshop on Electronics for Particle Physics (TWEPP2019) 02-06 September 2019 Santiago de Compostela, Spain

<sup>1</sup> Speaker

<sup>©</sup> Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

## 1. Introduction

The Micro Vertex Detector (MVD) is being designed using CMOS pixel sensors to meet the needs of the Compressed Baryonic Matter (CBM) experiment [1]. The CBM experiment will smooth observe with gold-gold and proton-gold systems. The high-ionization particles or ions generated by collisions, such as gold, carbon and proton, may induce single-event effects (SEE) into the MVD sensors. SEEs may result in temporary or permanent circuit functional errors such as single-event upset (SEU), single-event transient (SET), single-event latch up (SEL), single-event functional interrupt (SEFI), etc.

The main design philosophy against SEU and SET is to apply redundancy in a circuit. Regarding data word protection, the error detection and correction (EDAC) encoding is used to ensure data accuracy. Meanwhile in logic circuit, the Triple Modular Redundancy (TMR) is realized to achieve high reliability. Redundancy is very effective against SEU and SET, but on the cost of power, area and circuit speed [2]. In TMR, a hardened circuit has at least three times the power and area of the original circuit. Therefore, there is a crucial need to balance redundancy implementation against benefits.

The MIMOSIS-1 sensor is now being designed as the prototype sensor of the MVD for the CBM experiment. In order to match the readout speed and high data compression specifications [3], a 3-layer-buffer structure is proposed as the readout system of MIMOSIS-1, shown in Figure 1. This 3-level buffer converts the read process from a fast burst data stream to a slow sequential data stream. This work is done using the TowerJazz 180nm process design kit.



Fig. 1. MIMOSIS-1 functional organization, with the three layers buffer (RRU, SRRU and Elastic Buffer).

### 2. Design and Optimize of TMR Circuit

#### 2.1 TMR deployment strategy

The readout system is divided into two parts by circuit functions: control logic and data buffer path. Usually, if SEE occurs in the control logic, it will affect the working state of the system, so there will be a lot of erroneous data in the output. If SEE occurs in the data path, it only affects a single data word. Consequently, in this design, we mainly focus on the control logic.

In digital circuit design, a control logic is described as an FSM, which is driven by an internal status counter. The status counter updates the status on each clock based on the input signal and the current status value. This means that there is an assignment loop in the circuit, as shown in

Figures 2(a) and (b). If SEE occurs in an assignment loop and the inputs not change, the state will not be automatically restored until next reinitialization. This situation is common in a readout system. In contrast, if SEE pulse occurs in the sequential processing, as shown in Figure 2(c), the status will be flushed during processing. In view of the above, the self-assignment registers, especially the status counters of FSM, need a high level of redundancy.



Fig.2. Two types of assignment loop: (a) self-assign (b) loop-assign, and (c) flow processing.

The time to recover from a SEE pulse provides another indicator to decide for deploying TMR. Figure 3 shows an example of an FSM recovery mechanism. Part (a) is a normal timetable where the FSM starts from the idle state and ends in an idle state after the A, B, C, and D steps. Part (b) shows that if the SEE pulse occurs in the second step and the state transits to an indeterminate state, the state will also return to the idle state triggered by the FSM end signal. The recovery time is no more than the processing cycle length. If the FSM is protected by TMR, the SEE error will be corrected immediately on the next clock, such depicted in (c). The recovery time is then only one clock cycle.



Fig.3. Sketch of an FSM recovery mechanism: (a) normal schedule (b) reset by end (c) recovered by TMR.

The readout procedure in MIMOSIS-1 is operated frame-by-frame. The FSM during readout is refreshed every frame, while the frame controller is not. The controller counts the frame numbers and generates start and end signals to divide the data into frames. If a SEE pulse occurs in the FSM during readout, it only affects a single frame of data. Also, if a SEE pulse occurs in the frame controller, an error will be generated until the next reset. Therefore, the frame controller has a higher level of redundancy than the read process FSM.

#### 2.2 Verification by Simulations

The reliability of the TMR deployment design is evaluated by simulation. The simulation is based on digital post-simulation. The simulation of the netlist and parasitic parameters extracted from the layout reflects the actual performance of the circuit with cell delay and timing constraints. In this case, the circuit reaction after a SEE pulse can be displayed by injecting an external condition. The pulses generated by SEE are modelled in the TCAD tool as transient pulses in the circuit nodes. The duration of the pulse depends primarily on the linear energy transfer (LET) and incoming angle of particles. We constructed the schematic input of Figure 6. The NMOS transistor in the driver inverter is described as a TCAD mesh model, and the rest of the circuit is a SPICE model. Heavy ions with different LETs are implanted into the drain region. As shown in Figure 4, the response of the circuit is monitored at node Vout. The pulse width is increased not linearly but logarithmically with LETs. For different load capacitances, the pulse width is shown in Figure 5. The results offer us a model for the SEE pulse generation in the verification simulations.

Through the simulation results, we model the SEE pulses with various LET and driver load ratio. In the RTL-level simulation, we generate a series of random SEE pulses. The LET is based on the distribution of incident particles and the drive load ratio is referenced on the netlist prototype. It is deployed through the simulator's "force" and "release" features. The readout results are collected and compared to the standard results to account for errors that occurred during the process. The simulation was run 100 times using different random seeds to increase the statistical significance of the results. The tool used to elaborate the netlist and generate SEE vectors is called SEEG in the TMRG toolset [4]. The simulation runs on NCsim in the Cadence Incisive Logic Verification toolset.



Fig.4 Voltage shift versus time under different LET. Fig.5 The Pulse width versus fanout under different LET.



Fig.6. The schematic of TCAD simulation.

Fig.7. The iterations of design and verification.

Design and verification are done alternately on a module-by-module basis. The design flow by iteration is shown in Figure 7. In the module reliability verifications procedure, the module in design is described at RTL level, while others are the remaining behavior levels. In order to keep the system error rate in line with the design requirements, the error rate of each component needs to be 1-2 orders of magnitude lower than the system requirements. At the end of the design, the system reliability verification is performed on the netlist with parasitic parameters that are output after layout and routing.

#### 3. Results and Conclusions

Table 1 shows the circuit comparisons before and after the TMR deployment. Without optimization, the power and area after TMR is three times larger than before but reach acceptable level after optimization by module and system level design and verification iterations. The MTTF of the circuit is measured as the number of SEE events that occurred before the system failed. According to Table 1, we found that without TMR, the system is susceptible to SEE (less than 100 SEEs). But with TMR, the system can still run after  $10^7$  SEEs, which corresponds to more than 5 years of operation. The error rate is measured by the number of data errors after the SEE has occurred. In the original design, the system is very susceptible to SEE and is prone to failure. The error rate is too high to be robustly quantified measured. Meanwhile, the error rate in the TMR deployment design is quite low (~1 data error / 7 SEEs). And the equivalent error rate in the working environment is lower than  $10^{-9}$ .

|                                                  |        | 1                 | 1                 | 5                 |
|--------------------------------------------------|--------|-------------------|-------------------|-------------------|
|                                                  |        | Super Region      | Elastic Buffer    | Readout System    |
|                                                  |        | Readout Unit      |                   |                   |
| Area                                             | Before | 1.75              | 6.37              | 43.47             |
| $(mm^2)$                                         | After  | 2.20              | 7.93              | 50.56             |
| MTTF<br>(SEEs to failure)<br>(Min/Mid/Max)       | Before | 3/13/48           | 5/16/78           | 12/18/32          |
|                                                  | After  | >107              | >107              | >107              |
| Error rate<br>(data errors/SEE)<br>(Min/Mid/Max) | Before | >1000             | >1000             | >1000             |
|                                                  | After  | 0.015/0.085/0.185 | 0.048/0.055/0.059 | 0.050/0.135/0.219 |

Table 1 Comparisons before and after TMR deployment.

The TMR deployment strategy is a low-cost method to design high reliability circuits. The SEE model and verification method provide us with a guide to the optimal design that meets the requirements. The circuit designed under this optimization is now in fabrication and its reliability will be tested under particle irradiation. Through the iteration of TMR design and verification, a complete system that is not sensitive to SEE can be designed and optimized in a specific particle environment.

#### References

- M. Koziel, S. Amar-Youcef, N. Bialas, M. Deveaux, I. Fröhlich, Q. Li, J. Michel, et al., The prototype of the Micro Vertex Detector of the CBM Experiment, Nucl. Inst. Meth. A 732, (2013) 515.
- [2] O. Ruano, J. A. Maestro and P. Reviriego, A Methodology for Automatic Insertion of Selective TMR in Digital Circuits Affected by SEUs, in IEEE Trans. Nucl. Sci. 56, (2009).
- [3] H. R. Schmidt, The silicon tracking system of the CBM experiment at FAIR, Nucl. Inst. Meth. A 936, (2019) 630.
- [4] S. Kulis, Single Event Effects Mitigation with TMRG Tool, in J. Instrum. 12, (2017).