

# Highly-linear FPGA-based Data Acquisition System for Multi-channel SiPM Readout

# Todd Townsend<sup>1</sup>, Yuxuan Tang, Jinghong Chen

Dept. of Electrical and Computer Engineering, University of Houston N308 Engineering Bldg 1, 4726 Calhoun Road, Houston, TX 77204-4005, USA E-mail: jttownsend@uh.edu, ytang11@uh.edu, jchen70@central.uh.edu

A 32-channel, 15 ps resolution, Kintex 7 FPGA-based data acquisition (DAQ) system for timeof-flight (TOF) and time-over-threshold (TOT) SiPM readout is demonstrated along with a comparison to previous works. Focusing on modern FPGA concerns such as clock skew and bin realignment, the implementation difficulties of FPGA-based TDCs are discussed including bubble error, zero length bins, inter-clock region nonlinearity, and chain overflow. Linearity of the TDC is improved by multichain averaging with comparison of 1, 2, and 4 chains pre and postcalibration. Measurement results of the proposed TDC include 11 ps mean bin size, a differential nonlinearity (DNL) of less than 4 ps, and an integral nonlinearity (INL) of less than 10 ps.

Topical Workshop on Electronics for Particle Physics (TWEPP 2019) 2-6 September 2019 Santiago de Compostela, Spain

#### <sup>1</sup>Speaker

© Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0).

#### 1. Introduction

High-linearity, high-resolution DAQ systems for multi-channel readout of SiPMs are required in many high-energy physics experiments. SiPMs offer distinct advantages compared to photomultiplier tubes (PMTs) including insensitivity to magnetic fields, very fast response, low bias voltage, and miniaturization. FPGA advancements continue to shorten the performance gap between ASIC-based and FPGA-based Time to Digital Converters (TDCs) in applications for DAQs. Although FPGAs are limited to predefined fabric structure and architecture, the use of linearization improvement techniques including wave union and multichain averaging are more easily implemented due to an abundance of carry chains and digital signal processing (DSP) blocks. Common implementation difficulties have plagued FPGA-based TDCs since the 1990s including bubble error within the thermometer-to-binary encoder, inter-clock region nonlinearity, and calibration [1]. Modern FPGAs exhibit new challenges due to advancements in fabric speed which shorten the difference between gate delay and path delay causing clock skew [2]. In this paper, an implementation of a 32-channel DAQ with 15 ps resolution TDC using a Xilinx Kintex-7 FPGA is presented along with an examination of each difficulty.

## 2. FPGA-based TDC Design and Methodology

Although ASIC-based TDCs remain superior to FPGA-based designs in both resolution (< 2 ps) and linearity (< 0.1 LSB), FPGA-based designs have the upper hand in design-time minimization, revision flexibility, and reduced initial investment [3]. Regarding particle physics experiments, FPGA-based TDCs often meet or exceed the relatively relaxed resolution requirements (tens of picoseconds) and can meet the linearity requirements if care is taken to remove, cancel, or average out the imperfect qualities of FPGA fabric. TDC linearity is a primary concern for particle physics TOT and TOF measurements. Any nonlinearity could skew results causing errors in the resulting conclusions. For this reason, a process methodology is followed to identify each source of nonlinearity from an initial basic design and eliminate each independently for optimization. Figure 1 (a) details the initial basic TDC design with a "pure carry" tapped delay line (TDL) and a single register-and-latch stage. Among the 9 referenced designs from past literature, 4 are derived from the "pure carry" TDL due to its simplicity and improved linearity versus the Vernier TDL. For the TDL, Xilinx offers a "Carry 4" look-up table (LUT) primitive which is portable across all Xilinx families and is typically utilized for sub-cycle arithmetic. As a pulse signal feeds in, the on-board clock acts as a stop pulse. Notice the basic pulse detector which triggers from an asynchronous, single flip-flop, rising-edge detector and a course state count of 1. The 400MHz clock is utilized for state-machine operation and TDC coarse clock counting of (2.5ns) increments between pulses. A sync signal is utilized for resetting the coarse counter. Figure 1 (b) represents the resulting code-density test results with bin (or tapped delay) number on the X axis and bin count (or delay length) on the Y axis. A code-density test feeds an independent clock into the Pulse input. Each bin should theoretically fill up equally. Any difference from equal bins indicates nonlinearity. Notice the 4 primary sources of nonlinearity: detector design which causes zero length bins at the front end, *clock skew* which causes evenly distributed zero bins throughout, encoder error from thermometer code to binary, and clock region crossing when the tapped delay line crosses a clock region boundary with a large additional delay versus a typical tapped delay.



Figure 1: (a) Initial TDC design with 400MHz coarse clock and single register, latch stage; (b) Initial TDC code density results with independent components of nonlinearity

## 3. Linearity Improvements

Each of the identified components of nonlinearity in Figure 1 (b) can be improved with specific adjustments to the TDC design. Zero-length front-end bins are caused by a lagging enable signal provided to the latching stage of registers. This can be improved by shifting the enable or "valid" signal forward in relation to the input pulse. Figure 2 details a synchronized-enable pulse detector with "valid" signal aligned with the pulse [4]. This approach also allows for a pipelined TDC implementation for reduced dead-time and increased coarse clock rate. A measurement of the average bin duration indicates 11ps. Each clock region has a maximum tapped delay count of 200. The coarse clock can now be increased to 500MHz to avoid crossing the clock region boundary and still have a healthy 10% margin of additional bins for process, voltage, and temperature drift.

Clock skew and bubble error cause zero length bins to be spread over the full TDL bin window often resulting in an odd, even effect with every other bin at ~zero length. Clock skew is a common phenomenon in < 40nm fabric technology due to smaller size delay cells becoming closer in duration (< 40ps) to the path delay (< 30ps) from global clock to register clock input [2]. Registers (or bins) which receive the latching clock (or TDC stop signal) early may latch on a "0" prematurely and allow a downstream register with a late arriving clock to latch on a "1" causing a bubble. Figure 3 (a) shows an implementation view of the clock signal feeding into each Carry4 SLICEL register. This simplified representation may differ from the actual layout, but it illustrates the reason behind non-homogeneous path lengths. Figure 3 (b) shows a post-implementation simulation of the average time from the global clock source to each SLICEL register clock input.



Figure 2: Synchronized-enable pulse detector



Figure 3: (a) Xilinx representation of global clock path to each register clk input within SLICEL; (b) Average post layout simulation results indicating path length from global clock to register clk inputs

Notice that register locations 1 and 3 within a SLICEL will always latch before the registers located at 2 and 4. A closer review of the measured bin lengths in Figure 1 (b) reveals this same odd, even effect with every other odd bin at zero length. By using a runtime Integrated Logic Analyzer (ILA), TDL thermometer code is reviewed on the fly to look for the degree of bubble error or number of zeros before the final "1". Only first order error exists indicating that a simple "if" statement may be added to the "Highest '1' (native)" encoder to decrement the binary output by 1 if the highest "1" has a "0" just prior [6]. This allows for the missed bins to fill properly as desired. Double buffering the first stage registers is also implemented for metastability improvement.

One last improvement includes multichain averaging with 4 simultaneous input paths with TDC output averaged to produce one final result. Figure 4 (a) illustrates the chain of 4 TDLs, and Figures 4 (b) and 4 (c) show the improvement from 1 chain to 4 for DNL and RMS.



Figure 4: (a) 4 chain parallel TDL; (b) DNL comparison of 1 chain, 2 chain, and 4 chain TDC; (c) RMS comparison of 1 chain vs 4 chain TDC with X axis offset due to significant INL in just 1 chain

## 4. Results

Post calibration results including the above-mentioned linearity optimization approaches are highlighted in Table 1 with 9 referenced designs for comparison. This work presents a superior INL and DNL to all previous designs, and a good average RMS resolution of 15 ps.

| Ref. | Method                              | Year | RMS Resolution | Tech. | Integral Non-<br>Linearity (INL) | Diff. Non-<br>Linearity (DNL) |
|------|-------------------------------------|------|----------------|-------|----------------------------------|-------------------------------|
|      | This Work<br>(Multichain averaging) | 2019 | 15ps           | 28nm  | -4.65ps, +9.59ps                 | 3.18ps, +3.46ps               |
| [7]  | Matrix of counters                  | 2017 | 7.4ps          | 65nm  | 11.6ps                           | 5.5ps                         |
| [8]  | Multichain averaging                | 2015 | 4.2ps          | 40nm  | -28.7ps, +18.2ps                 | -2.9ps, +11.72ps              |
| [9]  | Multi-phase clocks                  | 2012 | 625ps          | 65nm  | 31.25ps                          | 31.25ps                       |
| [10] | Ring oscillators                    | 2008 | 40ps           | 90nm  | <40ps                            | <40ps                         |
| [11] | Vernier TDL                         | 1997 | 129ps          | 650nm | 46ps                             | -144ps, +214ps                |
| [12] | Pure Carry TDL                      | 2009 | 17ps           | 65nm  | -51ps, +43.86                    | -17ps, +60.35ps               |
| [13] | Pure Carry TDL                      | 2013 | 15ps           | 65nm  | ±60ps                            | -15ps, +45ps                  |
| [14] | Ring oscillators                    | 2017 | 50ps           | 350nm | ±65ps                            | ±35.75ps                      |
| [1]  | Vernier TDL                         | 1997 | 200ps          | 650nm | <200ps                           | -94ps, +88ps                  |

Table 1: Final results compared to 9 previous works arranged by INL

#### 5. Conclusion

Design of a highly linear, 32 channel FPGA TDC for use in a DAQ for SiPM readout has been described with each component of nonlinearity identified. Linearity optimization is discussed with implementation of multichain averaging, bin realignment, avoidance of clock region crossing, and use of a synchronized-enable pulse detector. Implementation indicates that an FPGA-based TDC for TOT and TOF particle physics experimentation is possible if care is taken to minimize the inherent nonlinearity of FPGA fabric.

#### References

- J. Kalisz, R. Szplet, J. Pasierbinski and A. Poniecki, "Field-Programmable-Gate-Array-Based Timeto-Digital Converter with 200-ps Resolution," *IEEE Trans. Instrum. Meas.*, vol. 46, no. 1, pp. 51-55, Feb. 1997.
- [2] Y. Wang and C. Liu, "A 3.9 ps Time-Interval RMS Precision Time-to-Digital Converter Using a Dual-Sampling Method in an UltraScale FPGA," *IEEE Transactions on Nuclear Science*, vol. 63, no. 5, pp. 2617-2621, 2016.
- [3] X. Liu, L. Ma, J. Xiang, NaYan, H. Xie and X. Cai, "A Low Power TDC with 0.5ps Resolution for ADPLL in 40nm CMOS," IEEE, Shanghai, 2015.
- [4] H. Homulle and E. Charbon, "Basic FPGA TDC Design," TUDelft, 2015. [Online]. Available: https://cas.tudelft.nl/fpga\_tdc/TDC\_basic.html. [Accessed April 2019].
- [5] G. Cao, H. Xia and N. Dong, "An 18-ps TDC Using Timing Adjustment and Bin Realignment Methods in a Cyclone-IV FPGA," *Rev. Sci. Instrum.* 89, 054707, 2018.
- [6] Z. Jaworski, "Verilog HDL Model Based Thermometer-to-Binary Encoder with Bubble Error Correction," *MIXDES*, pp. 249-254, 23-25 June 2016.
- [7] M. Zhang, H. Wang and Y. Liu, "A 7.4 ps FPGA-Based TDC with a 1024-Unit Measurement Matrix," Sensors 2017, 17, 865, pp. 1-18, 2017.
- [8] Q. Shen, S. Liu, B. Qi, Q. An, S. Liao, P. Shang, C. Peng and W. Liu, "A 1.7 ps Equivalent Bin Size and 4.2 ps RMS FPGA TDC Based on Multichain Measurements Averaging Method," *IEEE Transactions on Nuclear Science*, vol. 62, no. 3, June 2015.
- [9] A. Balla, M. Beretta, P. Ciambrone, M. Gatta, F. Gonnella, L. Iafolla, M. Mascolo, R. Messi, D. Moricciani and D. Riondino, "Low Resource FPGA-based Time to Digital Converter," *To be submitted to IEEE Transaction on Instrumentation and Measurement*, pp. 1-7, 2012.
- [10] S. Junnarkar, P. O'Connor and R. Fontaine, "FPGA Based Self Calibrating 40 Picosend Resolution, Wide Range Time to Digital Converter," 2008 IEEE Nuclear Science Symposium Conference Record, pp. 3434-3439, 2008.
- [11] J. Kalisz, R. Szplet, R. Pelka and A. Poniecki, "Single-chip Interpolating Time Counter with 200-ps Resolution and 43-s Range," *IEEE Trans. Instrum. Meas.*, vol. 46, no. 4, pp. 851-856, Aug. 1997.
- [12] C. Favi and E. Charbon, "A 17ps Time-to-Digital Converter Implemented in 65nm FPGA Technology," FPGA '09, pp. 22-24, 2009.
- [13] L. Zhao, X. Hu, S. Liu, J. Wang, Q. Shen, H. Fan and Q. An, "The Design of a 16-Channel 15 ps TDC Implemented in a 65 nm FPGA," *IEEE Transactions on Nuclear Science*, vol. 60, no. 5, pp. 3532-3536, 2013.
- [14] A. A. Muntean, "Design of a fully digital analog SiPM with sub-50ps time conversion," Delft University of Technology Masters Thesis, pp. 1-67, 2017.