Alma Mater Studiorum  $\cdot$  University of Bologna

School of Science Department of Physics and Astronomy Master Degree in Physics

# FPGA-based hardware demonstrator of a Hough Transform pattern recognition algorithm for the ATLAS Phase-II trigger upgrade

Supervisor:

Prof. Alessandro Gabrielli

**Co-supervisor:** 

Dr. Fabrizio Alfonsi

Submitted by: Alice Santarelli

Academic Year 2020/2021

Those who can imagine anything, can create the impossible. (Alan Turing)

# Abstract

The advent of the High-Luminosity (HL) phase of the Large Hadron Collider (LHC) at CERN will contribute to study, with a significantly improved sensitivity, known mechanisms expected by the theory of the Standard Model and new rarer processes which can be the sign of physic Beyond Standard Model. In fact, in this new operational phase the increase of the luminosity will allow to produce the demanded larger datasets of proton-proton collisions. The LHC complex is planned to deliver in 2027 a luminosity up to  $7.5 \times 10^{34} \ cm^{-2} s^{-1}$ , corresponding to  $\mu = 200$  of events per bunch crossing, with the ultimate goal to provide an integrated luminosity of up to  $3000/4000 \ fb^{-1}$ . To come up with these new conditions, the detectors placed in the four interaction points of LHC require an upgrade. In particular, the higher number of signals produced inside the detectors would eventually make the trigger and readout electronics currently in use in these experiments obsolete and, for this reason, new strategies for data acquisition and processing will be necessary. This Master Thesis discusses the Phase-II upgrade of the Trigger and Data Acquisition system of the detector called A Toroidal LHC ApparatuS (ATLAS). An R&D program, which has included the creation of two task forces, was launched in the 2021 Spring with the aim to produce one engineered solution for the track reconstruction at the Event Filter (EF) level. The Electronic group of the University of Bologna is taking part in the proposal of the heterogeneous commodity task force, which consists of the previous project called Hardware Tracking for the Trigger. The heterogeneous solution is based on a mixed commodity platform of classic processors and accelerators, where track reconstruction is expected to be performed via the use of mathematical functions, for example through the implementation of Hough Transform algorithm on the FPGA which will be part of the EF of ATLAS Phase-II. The R&D

activities performed until now include the development of the firmware for the HT. Final goal of this Master Thesis is the creation of a hardware demonstrator able to test the ongoing firmware design. To fulfill this purpose, a new firmware architecture is exploited and it relies on a manageable Peripheral Component Interconnect Express transmission. The integration of the two firmware designs is realized with the development of a two first-in-first-out structure. In this way it is demonstrated the correct implementation of the ongoing Hough Transform firmware design.

# Contents

| Introduction i |          |        |                                                                                                           |    |  |  |  |  |  |
|----------------|----------|--------|-----------------------------------------------------------------------------------------------------------|----|--|--|--|--|--|
| Co             | Contents |        |                                                                                                           |    |  |  |  |  |  |
| 1              | CEI      | RN and | d LHC                                                                                                     | 1  |  |  |  |  |  |
|                | 1.1      | Struct | ure of LHC                                                                                                | 1  |  |  |  |  |  |
|                |          | 1.1.1  | LHC parameters                                                                                            | 5  |  |  |  |  |  |
|                | 1.2      | LHC s  | chedule                                                                                                   | 6  |  |  |  |  |  |
| <b>2</b>       | ATI      |        |                                                                                                           | 10 |  |  |  |  |  |
|                | 2.1      | Coordi | inate system                                                                                              | 10 |  |  |  |  |  |
|                | 2.2      | Detect | for composition $\ldots$ | 12 |  |  |  |  |  |
|                |          | 2.2.1  | Inner Detector                                                                                            | 13 |  |  |  |  |  |
|                |          | 2.2.2  | Calorimeters                                                                                              | 16 |  |  |  |  |  |
|                |          | 2.2.3  | Muon Spectrometer                                                                                         | 18 |  |  |  |  |  |
|                |          | 2.2.4  | Magnetic System                                                                                           | 20 |  |  |  |  |  |
|                |          | 2.2.5  | Forward detectors                                                                                         | 21 |  |  |  |  |  |
|                | 2.3      | Trigge | r and Data Acquisition system                                                                             | 22 |  |  |  |  |  |
| 3              | ATI      | LAS P  | hase-II upgrade                                                                                           | 26 |  |  |  |  |  |
|                | 3.1      | Upgra  | de proposals                                                                                              | 26 |  |  |  |  |  |
|                |          | 3.1.1  | Inner Tracker                                                                                             | 28 |  |  |  |  |  |
|                |          | 3.1.2  | High Granularity Timing Detector                                                                          | 29 |  |  |  |  |  |
|                |          | 3.1.3  | Calorimeter                                                                                               | 30 |  |  |  |  |  |

|              |     | 3.1.4   | Muon Spectrometer                 | 31 |
|--------------|-----|---------|-----------------------------------|----|
|              | 3.2 | Trigger | r and the Data Acquisition        | 33 |
|              |     | 3.2.1   | Baseline Architecture             | 35 |
|              |     | 3.2.2   | Evolved Architecture              | 38 |
|              |     | 3.2.3   | Hardware Tracking for the Trigger | 40 |
|              |     | 3.2.4   | A new baseline                    | 46 |
| 4            | Hou | igh Tra | ansform for tracking              | 49 |
|              | 4.1 | TDAQ    | overview                          | 49 |
|              |     | 4.1.1   | Commodity accelerators            | 51 |
|              | 4.2 | System  | a overview and functional blocks  | 51 |
|              | 4.3 | Hough   | transform for particle tracking   | 53 |
|              |     | 4.3.1   | Hough transform for circles       | 54 |
|              |     | 4.3.2   | Implementation of the HT $\ldots$ | 57 |
|              |     | 4.3.3   | The accumulator                   | 58 |
| <b>5</b>     | Den | nonstra | ator development for HT algorithm | 64 |
|              | 5.1 | Hardw   | are framework of Bologna          | 64 |
|              |     | 5.1.1   | Hardware HT algorithm             | 66 |
|              | 5.2 | Hardw   | are device for data transmission  | 68 |
|              |     | 5.2.1   | Experimental Set-up               | 70 |
|              |     | 5.2.2   | Data transmission through PCIe    | 71 |
|              | 5.3 | Prelim  | inary tests                       | 74 |
|              |     | 5.3.1   | PCIe FIFOs                        | 74 |
|              |     | 5.3.2   | FIRST and SECOND FIFOs            | 75 |
|              |     | 5.3.3   | Bits mapping                      | 75 |
|              | 5.4 | Final t | cests                             | 79 |
| A            | FPG | ĞΑ      |                                   | 89 |
| в            | PCI | e       |                                   | 93 |
| $\mathbf{C}$ | FIF | 0       |                                   | 96 |
|              |     |         |                                   |    |

# Chapter 1

# CERN and LHC

CERN, from the French acronym "Conseil Européen pour la Recherche Nucléaire", or European Council for Nuclear Research, was founded in 1952 with the aim of establishing a fundamental physics research organization in Europe. The laboratory sits astride the Franco-Swiss border near Geneva and nowadays physicists and engineers from around the world are probing the fundamental structure of the universe using the world's largest and most complex scientific instruments that are pushing the limits of technology.

## 1.1 Structure of LHC

The Large Hadron Collider (LHC) project started with the aim to design a high energy physics collider able to deliver a center of mass energy higher than the Large Electron-Positron (LEP) and Tevatron. The purpose was to investigate the nature of electroweak symmetry breaking and the search for physics beyond the Standard Model at the TeV scale. The realization of this particle accelerator was approved by CERN Council in December 1994 [1] and obtains important achievements during years of running, one of the most important was the discovery of the Higgs boson in 2012 [2] whose properties are under continuous study in order to confirm the Standard Model predictions and to search for new physics phenomena.

LHC is placed 100 m underground, near Geneva (Fig. 1.1), in the tunnel floor of the old LEP Collider. It consists of a 27.6 km ring of superconducting magnets with



Figure 1.1: The LHC underground position in the French-Swiss border[3][4].

a number of accelerating structures to boost the energy of the particles along the way. Differently from the previous particle-antiparticle colliders, in which both beams share the same phase space in a single ring, the LHC machine is based on a proton-proton (pp) collision. The two counter-rotating proton beams are currently accelerated to a center of mass energy of  $\sqrt{s} = 13$  TeV, value that in the next Run upgrade will be increased to 14 TeV. The LHC will also collide heavy ions (A), in particular lead nuclei, at 5.5 TeV per nucleon pair.

In order to reach these energies, the beam has to approach several steps of acceleration as it is illustrated in Fig. 1.2.

Protons are produced stripping electrons from small  $H_2$  silos and subsequently they go inside Linac2 where their energy is raised to 50 MeV. The circular Booster (PSB) accelerates them to 1.4 GeV and in the Proton Synchrotron (PS) they reach 25 GeV. The Super Proton Synchrotron (SPS) follows, here they acquire an energy of 450 GeV. In the final step protons are transferred to the LHC where each beam is accelerated to 6.5 TeV. For the heavy ions [5], instead, a linear accelerator called Linac3 takes the lead ions at an energy of 4.5 MeV/n and a Low Energy Ion Ring (LIER) accelerates them to 72 MeV/n. Then they enter the SPS and follow the same path as protons before entering LHC, reaching an initial energy of 5.9 GeV/n and then of 177 GeV/n. In the last ride, LHC accelerates lead ions at 1.38 TeV/n. The acceleration of the charged particles is fulfilled by a set of Radio Frequency (RF) cavities, whose task is to compensate for the synchrotron energy loss. This is a phenomenon that happens when a charged particle is accelerated in a circular collider, generating an electromagnetic radiation emission and it



Figure 1.2: The CERN accelerator system.

should be avoided as much as possible because it induces a relevant loss of energy. The RF cavities focus the packets of protons (bunches) along the beam-pipe. The proton beams are kept in circular track by a set of 1232 superconducting dipole magnets made with copper-clad niobium-titanium cables. The superconductivity is mandatory to reach the magnetic field necessary to achieve the energy required in the center-of-mass of the collision. On the other hand, to focus the protons perpendicularly to the beam pipe it is used a set of 858 quadrupole magnets that are placed one next to another and perpendicularly with respect to the poles. Furthermore, other multi-pole magnets are used all over the LHC. In order to sustain the two-ring architecture of LHC, twin bore magnets consisting of two sets of coils are used. The system requires a low temperature of 2 K and this can be realized through the use of liquid Helium.

Proton beams can circulate for many hours inside the LHC under normal operating conditions. Differently from the LEP collider, which had eight crossing points, LHC has four interaction points in which are placed four experiments that, between 1996 and 1998, received official approvals for their construction. Currently active at LHC, each one of these four experiments have different physics goals:

• ATLAS (A Toroidal LHC ApparatuS) [6] is a multi-purpose experiment built for probing pp (and lead-lead) collisions. This thesis work is related to this detector and, for this reason, a more detailed description will be provided in the next chapters.

- CMS (Compact Muon Solenoid) [7] is a multi-purpose experiment that was conceived, similarly to ATLAS, to study pp (and lead-lead) collisions. It is 21 m long, 15 m in diameter, with a weight of about 14,000 t. Although CMS has the same scientific goals as the ATLAS experiment, it uses different technical solutions and a different magnet-system design. Indeed the detector is built around a huge solenoid magnet that generates a field of 4 T. This structure surrounds an all-silicon pixel and strip tracker, a lead-tungstate scintillating-crystals electromagnetic calorimeter and a brass-scintillator sampling hadron calorimeter. The iron yoke of the flux-return is equipped with four stations of muon detectors covering most of the 4π solid angle.
- LHCb (LHC-beauty) [8] is a specific apparatus for pp collisions. It is dedicated to precision measurements of CP violation and rare decays of B hadrons. Instead of surrounding the entire collision point with an enclosed detector, such as in ATLAS and CMS, the LHCb experiment uses a series of subdetectors to detect mainly particles thrown forwards by the collision in one direction. The first subdetector is mounted close to the collision point, with the others following one behind the other, over a length of 20 m. Starting from the interaction point, are placed in order: a tracker, a ring imaging Cherenkov detector (RICH), other trackers, another RICH, an electromagnetic calorimeter, an hadronic calorimeter and a muon detector;
- ALICE (A Large Ion Collider Experiment) [9] is a general-purpose, heavy-ion detector which investigate the strong-interaction sector of the Standard Model, the QCD. With a weight of 10,000 t, the detector is 26 m long, 16 m high, and 16 m wide. It is designed to study strongly interacting matter and the quark-gluon plasma at extreme values of energy density and temperature in nucleus-nucleus collisions. The physics programme of this experiment does not only include lead ions and protons running, but also lighter ions collisions, lower energy running and dedicated proton-nucleus runs. It is composed of 18 detectors surrounding the collision point that includes a time projection chamber (TPC), a transition radiation chamber, a "time of flight" detector, electromagnetic and hadronic calorimeters and

a muon spectrometer.

### 1.1.1 LHC parameters

The protons, after all the pre-LHC chain, reach an energy of 450 GeV/n and are injected in packets (bunches) into LHC. Each bunch contains  $\approx 1.2 \times 10^{11}$  protons and is  $\sim 7.55$  cm long and 16.7  $\mu$ m squeezed radially. The final beam consists of 2808 bunches of protons. Two of them collide at the same energy, speed and direction and opposite verse. The number of collisions between protons per bunch-crossing (pileup) must be as high as possible ( $\mu \sim 15-50$  along Run 1 and 2, and  $\mu \sim 150-200$  targeted for High Luminosity LHC, presented in the next Section 1.2). The focusing inside the LHC also increases the number of collisions. Every 25 ns there is a collision and this is an important parameter since this implies that, consequently, all the detectors at LHC need to conform to the 40 MHz collider frequency. The energy reached at the collision center-of-mass by the proton in the acceleration chain can be calculated with the relativistic formulation, with the Lorents factor of 7460. LHC has been designed to reach an instantaneous luminosity with a peak of  $10^{34} \ cm^{-2} \ s^{-1}$  (for lead nuclei  $10^{27} \ cm^{-2} \ s^{-1}$ ) and a center of mass energy of 14 TeV (5.5 Tev for lead nuclei). The instantaneous luminosity expresses the collider performance, as well as the capability of the apparatus to generate physics events, based on the energy and density of the particles. It is defined as:

$$\mathcal{L} = f \frac{n_1 \cdot n_2}{4\pi \cdot \sigma_x \cdot \sigma_y} F \tag{1.1}$$

where  $n_i$  is the number of particles in the accelerator, f is the revolution frequency of the bunches and  $\sigma_x, \sigma_y$  is related to the transverse dimensions of the beam. The 1.1 can also be expressed in terms of the number of bunches inside the ring  $(n_b)$ , the number of particles per bunch  $(N_b)$ , the revolution frequency of the bunches in the accelerator  $(f_{rev}, \text{ that is } 11.2 \text{ kHz})$ , the relativistic Lorentz factor  $(\gamma_r)$ , the normalized transverse beam emittance  $(\epsilon_n, \text{ that is } 3.75 \ \mu\text{m})$ , the beta function of the collision point  $(\beta^*, \text{ that}$ is 0.55 m) and the geometric luminosity reduction factor, due to the crossing angle of the two beams at the interaction point (F):

$$\mathcal{L} = \frac{1}{\sigma} \frac{dN}{dt} = \frac{N_b^2 \cdot n_b \cdot f_{rev} \cdot \gamma_r}{4\pi \cdot \epsilon_r \cdot \beta^*} F$$
(1.2)

The Inegrated luminosity can be in this way calculated,

$$L = \int_{\Delta t} \mathcal{L} \cdot dt \tag{1.3}$$

and, together with the cross section, gives the total number of events in a Run:  $N_e = L \cdot \sigma_e$ .

## 1.2 LHC schedule

The scientific program of the LHC, summarized in Fig. 1.3, spans over many years of operation and includes an ambitious series of future upgrades. In the first period of operation (Run1) the instantaneous luminosity reached was  $7.7 \times 10^{33} \ cm^{-2} s^{-1}$  and the center of mass energy spanned in a range from 900 GeV up to 8 TeV.



Figure 1.3: Timetable for the future years of LHC in terms of different phases. Runs are the working periods of the collider including data taking by the experiments, while Long Shutdowns represents the stop periods due to upgrades required by the accelerator and detectors.

#### 1.2 LHC schedule

The bunch crossing time was 50 ns, double compared to the design specifications. Energy and  $\mathcal{L}$  were very promising at that time: over half of the target features. With these parameters the Higgs boson was observed in 2012 and, at the beginning of 2013, Run 1 concluded. The Long Shut-down 1 (LS1) followed between 2013 and 2014, where machine elements were consolidated. The magnet splices were repaired and the collimation scheme was upgraded in order to achieve the design beam energy and luminosity. Since 3 June 2015 the LHC has operated in Run 2 at center of mass energy of 13 TeV and has progressively reached the luminosity of  $\mathcal{L} = 1 \times 10^{34} \ cm^{-2} \ s^{-1}$  on 26 June 2016. Despite the reduced number of bunches (about 2200 cf. 2800 nominal), it was obtained a peak luminosity up to  $1.2 \times 10^{34} \ cm^{-2} \ s^{-1}$  through the reduced emittance from the injectors and a  $\beta^*$  value of 40 cm (cf. 55 cm nominal value) at the high luminosity

Despite the reduced number of bunches (about 2200 cf. 2800 nominal), it was obtained a peak luminosity up to  $1.2 \times 10^{34} \ cm^{-2} \ s^{-1}$  through the reduced emittance from the injectors and a  $\beta^*$  value of 40 cm (cf. 55 cm nominal value) at the high luminosity interaction points. Total integrated luminosity was of about 35  $fb^{-1}$ . The Run 2 proton physics ended marking the conclusion of an extremely successful data taking period. Approaching the year 2024, the LHC will hopefully further increase the peak luminosity. During the Long Shutdown 2 (LS2), between 2019 and 2020, the Linac4 was connected into the injector complex and with the injection beam energy of the Proton Synchrotron Booster. Moreover, new cryogenics plants will be subsequently installed to separate the cooling of the superconducting radio frequency modules and the magnet cooling circuit. From 2022, during Run 3, the LHC design parameters should allow for an ultimate peak instantaneous luminosity of  $L \sim 2.2 \times 10^{34} \ cm^{-2} \ s^{-1}$  (Phase-I operation) and for delivering an integrated luminosity of ~ 300  $fb^{-1}$ . The end of this run is scheduled for 2024 and the expectation is that the statistical gain, in running the accelerator without a significant luminosity increase beyond its design and ultimate values, will become marginal.

The LHC ring comes with the major results reached during years of operation such as the discovery of the Higgs boson or the high precision measurements of the physics at the electroweak scale. As a consequence, stronger confidence of the LHC potential arises with the aim to realize a major luminosity upgrade that was approved in Brussels on 30 May 2013: 'Europe's top priority should be the exploitation of the full potential of the LHC, including the high luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030'

to 4000 after a period of about 12 years.

[10]. The running time necessary to have the statistical error of a given measurement after 2020 will be more than ten years. Therefore, to maintain scientific progress and to exploit its full capacity, the LHC will need a decisive increase of its luminosity after 2020. In the Long Shutdown 3 (LS3), from 2024 to 2026, the LHC will undergo a major upgrade of its components, like low- $\beta$  quadrupole triplets and the use of crab cavities at the interaction regions. After the LS3 the Phase-II will begin and it regards the High Luminosity LHC (HL-LHC), called in this way since its instantaneous luminosity will significantly increase. Main motivations for a high luminosity regime consist in the need for significantly large data sets, that would offer the possibility to improve the sensitivity of actual measurements and perform completely new ones. In the first proposals the nominal levelled instantaneous luminosity should have reached a value of  $L = 5 \times 10^{34} \ cm^{-2} \ s^{-1}$ , corresponding to an average of roughly  $\langle \mu \rangle = 140$  inelastic proton-proton collisions per beam crossing (pileup). Later, a new scenario with an ultimate levelled luminosity was introduced, with a peak up to  $L \sim 7.5 \times 10^{34} \ cm^{-2} \ s^{-1}$ , corresponding to  $\mu = 200$ , delivering an accumulated integrated luminosity of around 3000  $fb^{-1}$  per year of operation. Data collected by HL-LHC run therefore, would be an order of magnitude more than previously. HL-LHC was expected to be active for the operations in the second half of 2026, but because of some delays the timetable has been slightly postponed. The final goal is to provide an ultimate integrated luminosity of up

# Chapter 2

# ATLAS

ATLAS [11] is a multi-purpose experiment where over 3000 physicists from over 175 institutes collaborate. It rises in the so-called Point 1 at CERN, 100 m underground, and has a forward-backward symmetric cylindrical geometry with a nearly  $4\pi$  coverage in solid angle. The dimensions of the detector are 25 m in height and 44 m in length and its overall weight is approximately 7000 t. ATLAS studies proton-proton and heavy-ion collisions at the LHC. This chapter introduces the architecture of the detector.

## 2.1 Coordinate system

The ATLAS detector and the particles emerging from the pp collisions are described using the coordinate system illustrated in this section.

As it is shown in Fig. 2.1, the nominal interaction point (IP) is defined as the origin of the coordinate system in a 3D Cartesian frame of reference, with coordinates (x, y, z). The z-axis defines the beam direction while the x-y plane is transverse to the beam direction. In particular, the positive x-axis points the centre of the LHC ring and the positive y-axis points upwards. In addition, the positive z-axis defines the side-A of the detector, while the negative z-axis the side-C. The transverse plane can be also described with the  $r - \phi$  coordinates. The azimuthal angle  $\phi$  is measured from the positive x-axis, while the polar angle  $\theta$  is the angle from the positive z-axis, along the detector. The radial coordinate r, describes the distance from the beam. In general, instead of  $\theta$  it is



Figure 2.1: Common coordinate system used in the ATLAS experiment.

used the pseudorapidity, a function of the angular position of the particle, not taking into account its nature and energy:

$$\eta = -\ln \tan(\frac{\theta}{2}). \tag{2.1}$$

The pseudorapidity ranges from 0, alongside the y-axis, to infinity, alongside the zaxis. However, considering massive objects such as jets, it is used the rapidity, which is Lorentz-invariant for transformations along the z-axis and is defined as:

$$y = \frac{1}{2} ln[\frac{E+p_l}{E-p_l}].$$
 (2.2)

where  $p_l$  is the particle linear momentum. The transverse momentum  $p_T$ , the transverse energy  $E_T$  and the missing transverse energy  $E_T^{miss}$ , are defined in the x-y plane unless stated otherwise. Based on these considerations, it is possible to measure the angular distance between two particles in the pseudorapidity-azimuthal angle space:

$$\Delta R = \sqrt{\Delta \eta^2 + \Delta \phi^2}.$$
 (2.3)

The importance of this new coordinates system  $(\eta, \phi, z)$  is that it is Lorentz invariant under boosts along the z-axis.

| Detector component   | Required resolution                          | $\eta$ coverage        |  |
|----------------------|----------------------------------------------|------------------------|--|
|                      |                                              | Measurements (Trigger) |  |
| Tracking             | $\sigma_{p_T}/p_T = 0.05\% p_T \oplus 1\%$   | $\pm 2.5$              |  |
| EM calorimeter       | $\sigma_E/E = 10\%\sqrt{E} \oplus 0.7\%$     | $\pm 3.2 \ (\pm 2.5)$  |  |
| Hadronic calorimeter |                                              |                        |  |
| barrel and end-cap   | $\sigma_E/E = 50\%\sqrt{E} \oplus 3\%$       | $\pm 3.2 \ (\pm 3.2)$  |  |
| forward              | $\sigma_E/E = 100\%\sqrt{E} \oplus 10\%$     | $3.1 <  \eta  < 4.9$   |  |
|                      |                                              | $(3.1 <  \eta  < 4.9)$ |  |
| Muon spectrometer    | $\sigma_{p_T}/p_T = 10\%$ at $p_T = 1 \ TeV$ | $\pm 2.7 \ (\pm 2.4)$  |  |

Table 2.1: General performance goals of the ATLAS detector. Units for E and  $p_T$  are in GeV. The muon-spectrometer performance is independent of the inner-detector system for high  $p_T$ .

# 2.2 Detector composition

ATLAS structure is composed of different detectors, where each one of them covers a pseudorapidity  $(\eta)$  range and has a specific purpose. The overall ATLAS detector layout is shown in Fig. 2.2 and its main performance goals are listed in Tab.2.1.

The ATLAS inner tracking detector (ID) is the first detector which the produced particles traverse after the pp interaction. All this tracking system is surrounded by a thin superconducting solenoid, which provides a magnetic field of 2 T that allows it to perform momentum measurements. The magnet configuration has driven the design



Figure 2.2: Composition of the ATLAS experiment detector with a recent image.

of the rest of the detector. It comprises three large superconducting toroid magnets (one in the barrel and two in the end-caps) that are arranged with an 8-fold azimuthal symmetry around the other two ATLAS detectors: the calorimeters. As the radial distance increases, the electromagnetic calorimeter is the first and its goal is to measure energy and the track of electrons, positrons and photons. However, all these detectors cannot measure the more energetic particles, the hadrons, and for this reason there is also the hadronic calorimeter which measures their energy. Eventually, only the particles with a very low cross section survive, mainly muons and neutrinos. The former can be detected in the muon spectrometer while the latter cannot be detected directly by ATLAS, thus they are studied with the missing energy technique.

### 2.2.1 Inner Detector

The Inner Detector [12] is the closest to the beam line, hence its technology must be very resistant to radiation. This apparatus is 6.2 m long with a diameter of 2.1 m and its pseudorapidity range coverage is  $|\eta| < 2.5$ . ID was built for the early tracking stage of ATLAS and it is composed of: the Pixel Detector (PD) [13], the Semiconductor Tracking (SCT) [14] and the Transition Radiation Tracker (TRT) [15]. During 2014 a new detector, the Insertable Barrel Layer (IBL) [16], was added. The ATLAS ID, including the IBL detector and its envelope, is shown in Fig. 2.3 with also the 3-dimensional structure of the IBL detector with its services. In Tab. 2.2 are listed the general characteristics of every sub-components.

The task of all these detectors is to provide a high precision measurement of the track of the particles and this can be performed through the inside-out and the outsidein algorithms. The first one uses three seeds in the silicon detectors (PPD and SCT) in order to reconstruct the track of charged particles coming from primary interactions. To do this, a combinatorial Kalman-filter algorithm adds the following hits. In the outside-in algorithm, instead, the reconstruction of the track of secondary particles is done starting from the hits in the TRD. Silicon hits, if present, are added with the combinatorial Kalman-filter algorithm. The efficiency of track reconstruction is measured by simulated events, and it varies as a function of  $p_T$  and  $\eta$ .



Figure 2.3: Section of the Pixel Detector with the distances of the sub-detectors and layers from the LHC beam pipe[19]

#### The Pixel Detector

The Pixel Detector (PD) is a silicon based detector which uses the pixel technology. PD has the highest granularity in all ATLAS and it consists of three disks (for each side of the interaction region) and four barrels layers: Insertable B-Layer, B-Layer, Layer1, Layer 2. The Insertable B-Layer (IBL) [20] is a new innermost tracking detector that was installed during the LS1 between the B-Layer and a new smaller radius beam pipe. This is the latest upgrade of the pixel detector. The layer was added with the intention to maintain the full ID tracking performances and robustness during the Phase-I operation despite read-out bandwidth limitations of the Pixel layers at the expected peak luminosity and the accumulated radiation damage to the silicon sensors and frontend electronics. The IBL is designed to operate until the end of Phase-I, when a full tracker upgrade is planned for HL-LHC operation. It consists of 14 carbon composite staves, providing full azimuthal ( $\phi$ ) hermeticity for high transverse momentum ( $p_T > 1$ GeV) particles and longitudinal coverage up to  $|\eta| = 3$ . Each stave supports 20 pixel

| Detector                                          | Hits tracks | Elements size             | Hits Resolution $(\mu m)$ |
|---------------------------------------------------|-------------|---------------------------|---------------------------|
| PD, $ \eta  < 2.5$                                |             |                           |                           |
| 4 barrel layers                                   | 3           | $50 \times 400 \ \mu m^2$ | $10(R-\phi) - 115(z)$     |
| $3\times 2$ lateral disks                         | 3           | $50 \times 400 \ \mu m^2$ | $10(R-\phi) - 115(z)$     |
| SCT, $ \eta  < 2.5$                               |             |                           |                           |
| 4 barrel layers                                   | 8           | $50~\mu m$                | $17(R-\phi) - 580(z)$     |
| $3\times 2$ lateral disks                         | 8           | $50 \ \mu m$              | $17(R-\phi) - 580(z)$     |
| TRD, $ \eta  < 2.0$                               |             |                           |                           |
| 83 barrel tubes                                   | 30          | $d=4\ mm, l=144\ mm$      | 130/straw                 |
| $9\times 2\mathrm{end}\text{-}\mathrm{cap}$ disks | 30          | $d=4mm, l=37\ mm$         | 130/straw                 |

Table 2.2: Main characteristics of the ID's detector.

sensor modules together with their electrical services and a cooling pipe. Every module is constructed from a pixel sensor with pixels of nominal size  $250 \times 50 \ \mu m^2$  electrically bonded to a channel of a read-out chip, the FE-I4B. With the usage of this chip, the new technology of IBL makes the detector itself more radiation hard and with a higher surface coverage.

### The Semi-Conductor Tracker

The Semi-Conductor Tracker (SCT) is a 4-layer silicon microstrip detector. Each layer is formed by modules composed by two microstrip detectors bound together and glued with a 40 mrad angle of their planes. This layout is used to obtain a better z-measurement. In the end-cap region the plane of the microstrip detector is perpendicular to the beam line, while in the barrel region is parallel.

#### The Transition Radiation Tracker

The Transition Radiation Tracker (TRT) is the largest track detector of ID and surrounds the previous two. It consists of about  $5 \times 10^4$  straw tubes that are cylindrical, with one positive wire in their inside and the internal wall at negative voltage. In the barrel region the tubes are parallel to the beam line, while in the end-cap region are perpendicular. With an high number of hits, the straws all together contribute to the measurement of the particle momentum. Every straw is filled with a mixture of Xenon (70 %),  $CO_2(27 \%)$ , and  $O_2$  (3 %).

### 2.2.2 Calorimeters

The ATLAS calorimeter system presented in Fig. 2.4 is composed of two sampling calorimeters, electromagnetic and hadronic, that measure the energy of a crossing particle. The entire sub-detector is constituted of different multiple parts, according to the particle type. Each calorimeter consists of four parts: a barrel part, an extended barrel part, an end-cap part and a forward part. The whole system covers a pseudorapidity up to  $\eta = 4.9$  and a complete  $\phi$  coverage.

#### Electromagnetic calorimeter

The electromagnetic sampling calorimeter contains Lead/liquid-argon (LAr). Due to its radiation hardness and its good energy resolution, liquid argon is a good active medium, while the lead is a good absorber. The track and the electromagnetic energy measurements of electrons, positrons, photons and  $\pi^0$  is provided exploiting the electromagnetic showers produced inside it. Since about 99% of the shower energy is emitted at the most in 20  $X_0$ , the EM calorimeter is 22 radiation lengths ( $X_0$ ) deep in the barrel region and more than 24 in the end-caps. The barrel region covers a  $|\eta| < 1.475$  and the end-cap region is composed by two coaxial wheels where the outer covers  $1.375 < |\eta| < 2.5$  and the innermost one covers with  $2.5 < |\eta| < 3.2$ . The region of  $|\eta| < 2.5$  is segmented in three parts, where the first layer is granulated finely in  $\eta$  to achieve a high photon-neutral pion separation. In the barrel region it is possible to discriminate photons and electrons between  $\sim 5 \ GeV$  and  $5 \ TeV$ . The resolution achievable in the barrel and end-cap region is

$$\frac{\sigma_E}{E} = \frac{9.4\%}{\sqrt{E(GeV)}} \oplus 0.1\%. \tag{2.4}$$

In general, in resolution measurements, the first factor describes the stochastic behaviour,



Figure 2.4: Scheme of ATLAS Calorimeter system.

the second one refers to the electronic noise of the read-out channels and the last constant factor regards the temperature, the age of the detector, the radiation damages and others constant contributions. There is a "grey zone" in  $1.37 < |\eta| < 1.52$  not used for precision measurements because of the presence of the barrel-end cap transition zone, where the material reaches  $7X_0$ .

#### Hadronic calorimeter

The energy and the missing momentum of hadrons produced during pp collision or formed from secondary decays can be measured, instead, by the Hadronic Calorimeter (HCAL).

Differently from the other calorimeter, it exploit the strong interaction which occurs when particles travel inside it, developing hadronic showers. It is formed by the Hadronic Tile Calorimeters (HTC), a scintillator-tile calorimeter, by the Hadronic End-Caps Calorimeters (HEC) and the Forward Calorimeter (FCAL) which are both LAr calorimeters. The first one is composed of one central barrel and two smaller extended one at each side of the biggest cylinder. The pseudorapidity is  $|\eta| < 1.7$ . The covered interaction length is respectively 4.0, 1.4, and 1.8. Steel is used as the absorber while scintillating tiles are the active material. The energy resolution is different depending on the calorimeter composition:

$$\frac{\sigma_E}{E} = \frac{10\%}{E} + (1.2 \pm 0.1^{+0.5}_{-0.6}\%) \tag{2.5}$$

in the barrel and in the end-cap region,

$$\frac{\sigma_E}{E} = \frac{10\%}{E} + (2.5 \pm 0.4^{+1}_{-1.5}\%) \tag{2.6}$$

for the forward calorimeter.

### 2.2.3 Muon Spectrometer

Many physics processes in ATLAS require high  $p_T$  muons signatures and the muon spectrometer schematized in Fig. 2.5 plays an important role in the identification of this particle. The detector is designed to reach high precision and resolution and it also provides an independent muon trigger from the rest of the apparatus. The measurement is based on the magnetic deflection of muon tracks in the large superconducting aircore toroid magnets. The muon spectrometer is divided in barrel and end-cap region, in which toroid magnets are placed. The system is divided in two different groups: Precision Chambers and Trigger Chambers. These sub-detectors are composed by four different detector technologies, the former are composed by Monitored Drift Tubes and Cathode Strip Chambers while the latter by thin Gap Chambers and Resistive Plate Chambers. Muons with an energy lower than the threshold,  $p_T > 3 \ GeV/c$ , cannot be identified because they are completely absorbed before reaching the muon spectrometer. Pseudorapidity range in the whole system is  $|\eta| < 2.7$  and the measured resolution of  $p_T$ is about 20% at 1 TeV.

#### Monitored Drift Tubes and Cathode Strip Chambers

Monitored Drift Tubes (MDT) and Cathode Strip Chambers (CSC) measure the muon momentum. MDT chambers are drift chambers with two multi-layer drift tubes which are focused on precise measurement of the z coordinate in the barrel region. Here pseudorapidity is  $|\eta| < 2$ . Hit position of the particle can be reconstructed measuring



Figure 2.5: Overview of the ATLAS Muon Spectrometer

the drift time in single tubes. CSC are multi-wire chambers with strip cathodes for the measurement of muon momentum in the range of  $1.0 < |\eta| < 2.7$ . The CSC wires are composed of parallel anodes which are perpendicular to 1 mm large strips of opposite polarity. They are placed close to the beam pipe in the innermost layer of the end-cap.

#### Thin Gap Chambers and Resistive Plate Chambers

Thin Gap Chambers (TGC) and Resistive Plate Chambers (RPC) provide the online trigger. TGC in the end-cap region is a very thin multi-wire chamber. The spatial resolution of these detectors is 4 mm in the radial direction and 5 mm in the  $\phi$  coordinate. The anode-cathode spacing is smaller than the anode-anode spacing, leading to a drift time lower than 20 ns. The TGC are also used to improve the measurements along the  $\phi$ coordinate obtained from the precision chambers. RPC in the barrel region are gaseous parallel electrode-plate detectors, with a spatial resolution of 1 mm in two coordinates and a time resolution of 1.0 ns. This sub-detector works in the avalanche regime: when a charged particle passes inside the chamber, the primary ionization electrons are multiplied into avalanches by a high electric field.



Figure 2.6: Overview of the ATLAS magnet system.

## 2.2.4 Magnetic System

As was already anticipated, momenta measurements of charged particles are done by ATLAS through a system of superconducting magnets. They consist of a Central Solenoid (CS), placed between the ID and the calorimeter system, and three large aircore toroids (one in barrel and two in end-cap), which generate the magnetic field in the muon spectrometer. Fig. 2.6 shows this magnetic system structure. The 2 T magnetic field of the CS points in the positive z-axis direction. It is 5.3 m long, with a diameter of 2.4 m and a weight of 5 t. The operating temperature of 4.5 K is maintained by a cryostat shared with the electromagnetic calorimeter barrel. The barrel toroid consists of 8 flat superconducting race-track coils, 25.3 m long and 5 m wide each. The 8 coils in the torus are kept in position by 16 support rings. Its total weight is 830 t. Toroids magnet produce a magnetic field of 3.9 T and are cooled down to 4.7 K by the liquid helium. Two end-cap toroids are positioned inside the barrel toroid, at both ends, and provide the required 4.1 T magnetic field across a radial span of 1.5 m to 5 m. Each end-cap toroid has a weight of 240 t. The coil system of the end-cap toroid is rotated by an angle of 22.5° compared to the barrel toroid coil. In this way, radial overlap between the two coil systems is provided and the bending power optimized. The most important parameters for momentum measurements are:

$$I_1 = \frac{0.3}{p_T} \int_0^l Bsin(\theta)_{(d\vec{I},\vec{B})} dl,$$
 (2.7)

$$I_2 = \frac{0.3}{p_T} \int_0^{lsin(\theta)} \int_0^{\frac{r}{sin(\theta)}} Bsin(\theta)_{(d\vec{I},\vec{\mathbf{B}})} dl dr.$$
(2.8)

 $I_1$  describes the bending power field and  $I_2$  represents the total transverse deflection of the particle from its initial path. They are field integrals calculated on the azimuthal direction of the particle  $(l = r/sin(\theta))$  and on its radial trajectory.  $\theta$  is the longitudinal component of the angle between the track and the magnetic field.

## 2.2.5 Forward detectors

The ATLAS forward region is covered by a set of small sub-detectors: LUCID (Luminosity measurement using Cherenkov Integrating Detector) [19], ZDC (Zero-Degree Calorimeter) [20], AFP (ATLAS Forward Proton) [21] and ALFA (Absolute Luminosity For ATLAS) [22]. Fig. 2.7 shows their position along the beam line.



Figure 2.7: Infrastructure of the ATLAS Forward Detector.

#### LUCID

Luminosity measurement using Cherenkov Integrating Detector (LUCID) is a Cherenkov counter that monitors the luminosity delivered by the LHC accelerator. Two LUCID detectors are placed in both forward regions at 17 m from the interaction point in a symmetrical way. Each one is made of 16 photomultiplier tubes and 4 quartz fiber bundles.

In the quartz window, as well as in the fiber bundles, Cherenkov light is produced by the charge particle. The photomultipliers detect charged particles when the light is carried to the photomultipliers.

### ZDC

Zero-Degree Calorimeter (ZDC) detects forward neutrons, in both pp and heavy ion collisions, for  $|\eta| < 8.3$ . The detector is placed at 140 m in both sides of ATLAS and is composed by an electromagnetic module (about 29 radiation lengths thick), and three hadronic modules, made by tungsten with an embedded matrix of quartz rods, attached to photomultiplier tubes.

### AFP

The goal of ATLAS Forward Proton (AFP) is to measure transfer momentum and energy loss of protons emitted from the collision point in very forward directions. Along the beam line there are two AFP detectors, at 204 m and 217 m, that contain a 3D silicon tracker and a time-of-flight detector in the far stations.

#### ALFA

Absolute Luminosity For ATLAS (ALFA) is the furthest detector, located 237 m from the interaction point, on both ATLAS sides. Each detector is made of staggered layers of square-shaped scintillating fibers, read out by photomultiplier tubes. It aims to measure the elastic pp scattering at small angles. The detector can approach the beam very close without entering the machine vacuum because the set-up is installed in Roman Pot stations, vessels that are connected to the accelerator vacuum via bellows.

## 2.3 Trigger and Data Acquisition system

The Trigger and Data Acquisition (TDAQ) system [23], shown in Fig. 2.8, is a fundamental component of the ATLAS detector because it ensures optimal data-taking conditions. This thesis work is related to the ATLAS TDAQ system and, in particular,



Figure 2.8: Performance of the ATLAS Trigger System in 2015 [24].

in chapter 3 will be presented its future upgrade as a consequence of the high luminosity that will be achieved by HL-LHC ( $\sim 10^8$  pp processes produced in 1 s).

Data of collision event is moved from the detector readout electronics into frontend buffers at the bunch crossing rate. Only some of these events are interesting and might lead to new discoveries. The entire set of the events cannot be saved because it would require a memory not compatible with the storage technologies used as hard disk and tapes. Moreover, their costs of production and maintenance (about hundreds of PetaBytes of data produced per year) would be too high. Trigger and DAQ system select a few hundred events per second for recording to a permanent storage for later study. In order to do this, the DAQ system has to transport and assemble the event data from the front-end buffers to the recording on disk. To reduce the flow of data to manageable levels, ATLAS Run 2 Trigger started to exploit an event selection system based on a multi-level trigger: Level-1 (L1), Level-2 (L2) and Event-Filter (EF). The L1 trigger is hardware-based and processes data from the calorimeter and the muon detectors; more precisely the data to trigger comes from the RPC and the TGC chambers. The L1 trigger decision is taken by the Central Trigger Processor (CTP), which receives inputs from the L1 calorimeter (L1Calo) and L1 muon (L1Muon) triggers as well as several other subsystems like the Minimum Bias Trigger Scintillators (MBTS) [25], LUCID Cherenkov counter and the ZDC. This set of sub-detectors gives the trigger signatures as high- $p_t$ muons, electrons/photons, jets,  $\tau$  – leptons decaying into hadrons and missing transverse

energy. The data passing through the hardware are subjected to discrimination by the ReadOut Driver (ROD) structure which applies fragment building and associated error detection, data checking, transformation and monitoring. Then the data are received by a readout device called Read-Out System (ROS) which sends the information to the High-Level-Trigger (HLT), a set of the two subsystems: the Trigger Level-2 and Event Filter trigger. HLT is a processor farm exploiting 28k CPU to rapidly investigate the Region-of-Interest (RoI,  $\eta$  and  $\phi$ ) identified by the L1. The rate of the dataflow here is reduced to approximately 50 kHz, with a decision time for each collision of 2  $\mu s$  from the collision itself. The hardware-programmable coincidence logic rules are six muon  $p_T$ threshold for this part of the trigger, three for the 6-9 GeV (low  $p_T$ ) and three for the 9-35 GeV (high  $p_T$ ). The L2 trigger is software-based and operates from a large farm of about 40k CPU cores. Informations of RoIs from L1 can be used for regional reconstruction by the trigger algorithms. It can reach a rate less than 5 kHz in periods lower than 50 ms. Eventually, the Event Filter is the final stage of the trigger chain. It gives the possibility to reach 30 Hz in 4 s, that is the standard time of the off-line event reconstruction of the ATLAS TDAQ. The HLT achieves a further reduction to 0.4-1 kHz. After the events are accepted by the HLT, they are transferred to local storage at the experimental site and exported to the Tier-0 facility at CERN's computing centre for offline reconstruction.

# Chapter 3

# ATLAS Phase-II upgrade

High-Luminosity upgrade has the aim to expand the LHC research area for physics beyond the Standard Model. After the end of Run 3 the accelerator will be pushed to and beyond its structural limits in terms of peak of luminosity, pile-up and proton-proton collision energy. In order to keep up with these new conditions, the ATLAS experiment requires an upgrade in terms of sensitivity and precision of its detectors. The most important modifications planned for this so-called Phase-II involve the Trigger and Data Acquisition system, the Inner Tracker system, the Calorimeters (both the Liquid Argon and the Tile one) and the Muon Spectrometer.

## 3.1 Upgrade proposals

Physics research of the ATLAS detector during Phase-II will include:

- Precision measurements of the properties of the Higgs Boson (i.e. the coupling of fermions or self-coupling);
- Precision Standard Model measurements, (i.e. top mass and cross-section);
- Searches for Beyond Standard Model (i.e. Super Symmetry or long-lived particles);
- Flavour physics (i.e. rare B-meson decay);
- Heavy-Ion Physics.

HL-LHC will represent an extremely challenging environment to the ATLAS experiment. All modifications required to sustain this run period will be done during the Long Shutdown 3 (LS3) that targets to step forward in sensors, hardware, firmware, software and strategies to reach values of the LHC parameters one order of magnitude higher even than those planned. For example, the total integrated luminosity expected is at least 3000  $fb^{-1}$ , compared to the 300  $fb^{-1}$  of Run 3. As a consequence, data collected during the HL-LHC run will be ten times that of the initial design, by around 2030. In order to endure this new operational phase of working, the ATLAS experiment will require an upgrade as well. Exploring the new HL-LHC scenario, technical limits of the LHC capabilities give the possibility to study the performances of the detector through simulations, which ensures also that the Phase-II detector upgrade is able to take advantage of the ultimate luminosity. The ATLAS collaboration firstly gave a description of the initial plan for the Phase-II upgrade of the detector in the Letter of Intent (LoI) in 2012. Since then, the collaboration has been improving and refining the initial proposals considering that the possible development of the upgrade will depend on the actual maximum luminosity reached and on the mean number of interactions per bunch-crossing.

The most important upgrades planned by the ATLAS collaboration involve the Trigger and Data Acquisition (TDAQ) and the Inner Tracker (ITk) system, which are also the most expensive components of the whole detector. Other changes will be carried out in the Calorimeters (both the Liquid Argon and the Tile one) and in the Muon Spectrometer (MS). Some of the sub-detectors will be completely replaced but, in other cases, such as the most new sub-detectors and for the ones which can sustain the high pileup, only the electronics for the readout will be substituted. The new sub-detectors design is driven by the instantaneous Luminosity of  $7.5 \times 10^{34} \text{ cm}^{-2} \text{ s}^{-1}$ , letting a  $3000/4000 \text{ fb}^{-1}$ of integrated luminosity with a pile-up of 200. The detecting strategy will be analogous as the current experiment, with a similar kind of set up, with detectors placed at the same distance to the interaction point. The  $|\eta|$  coverage will slightly change as: the Inner Tracker will sample data from  $|\eta| = 4$ , the Muon Spectrometer will receive new RPC allowing to reach  $|\eta| < 1$  and the new High Granularity Timing Detector is planned to cover  $2.4 < \eta < 4.0$ . The following sections focus on ATLAS Phase-II most relevant upgrades planned for this new working period of HL-LHC.



Figure 3.1: Active areas of ITk are schematized in the ATLAS coordinate system, with in the vertical axis the radius from the beam pipe and in the horizontal axis the z coordinate parallel to the beam line [26].

### 3.1.1 Inner Tracker

The increased luminosity anticipated in previous chapters, along with the associated data rate and accumulated radiation damage, make the current ATLAS inner detector inoperable. The Inner Tracker (ITk) is a new all-silicon double tracker detector with the same function as the current Inner Detector (ID) of Section 2.2.1, that will represent the core of the new ATLAS detector. The general requirement for the ITk is to deliver equal or better tracking performance to that provided by the current ID, despite an average pile-up of up to  $\langle \mu \rangle = 200$  events. This can be stand through an "inclined layout" with  $|\eta| = 4$ . ITk is composed of the Strip Detector and the Pixel Detector. The former, starting from the outside, will have four barrel layers and six petal-designed end-cap disks covering  $|\eta| < 2.7$ . The Pixel Detector, instead, will have five flat layers, five inclined layers and five end-cap layers, together giving  $|\eta| < 4$ . The ITk planned layout is shown in Fig. 3.1, while through a software simulation it is possible to build the core scheme of ATLAS, as illustrated in Fig. 3.2.

This detector is designed to measure the transverse momentum and direction of isolated particles (particularly muons and electrons), and allows to reconstruct the vertices



Figure 3.2: Scheme of the ITk detector of ATLAS reconstructed by simulations.

of pile-up events and associate it with the hard interaction. Moreover, ITk will be able to reconstruct and identify secondary vertices in b-jets with a high efficiency and purity and also the decay of  $\tau$ -leptons, including impact parameter information. As already anticipated, the detector has to afford an environment in which the integrated radiation dose is ten times more than previous LHC conditions. Indeed, considering the instantaneous luminosity and the pileup at which HL-LHC operates, the radiation tolerance of the inner technology should achieve a resistance of 9.9 MGy.

New technologies are used to ensure that the system can survive this harsh radiation environment. Moreover, the new read-out scheme allows the implementation of a track trigger contributing to the major improvements in the online trigger capabilities using a CMOS technology fabricated front-end chip.

## 3.1.2 High Granularity Timing Detector

The High Granularity Timing Detector (HGTD) [27] is a new architecture proposed in order to increase the precision of luminosity measurements. A better view of where it is expected to be placed is shown in Fig. 3.3. This position is very strategic because it gives the possibility to measure both online luminosity bunch-per bunch during HL-LHC running and enhance the high precision sampling of the integrated luminosity. The consequent increase in the luminosity resolution is fundamental for the Higgs couplings survey. The HGTD will increase the spatial and time performance of ITk with a 30 ps



Figure 3.3: 3D view of the new HGTD detector and its position in the future ATLAS structure.

time resolution for the minimum ionizing particle going through the innermost detector. Between the HGTD and the end-cap/forward calorimeters region, a 50 mm moderator is planned to be inserted to protect from back-scattered neutrons the HGTD and the ITK. The front-end custom ASIC ALTIROC is being developed and it is planned to be bump-bonded to the silicon sensor that is currently under study. The advantage in developing this ASIC consists in the high time and spatial resolution and the radiation hardness provided. Other important operations are: counting the number of hits registered in the sensor; 40 MHz transmission to allow unbiased, bunch-per-bunch measurements of the luminosity; coping with the minimum-bias trigger. The HGTD end-cap will integrate: one hermetic vessel, two instrumented double sided layers mounted on two cooling/support disks and two moderator pieces internally and externally the hermetic vessel. The detecting region of HGTD will cover the range of  $2.4 < |\eta| < 4.0$ .

### 3.1.3 Calorimeter

The read-out electronics and the low-voltage powering system of the ATLAS Liquid Argon Calorimeter (LAr) will be updated [28] in order to overcome the technology limitation and obsolescence. Along with the modifications done during Phase-I, the LAr will gain readout performance efficiency from new boards in the on-detector region: the FEB2 (Front-End Board 2) and Calibration Board. The first one will be responsible to manage the analog processing and the second to inject calibration signals. On the off-



Figure 3.4: Calorimeter system scheme in the ATLAS Phase-II detector. The Long Barrel (LB) and the Extended Barrel (EB) are both divided in A and C.

detector side a new board, LAr Signal Processor Board (LSPB), will be used to transmit data to the DAQ structure. Its goal will be to digitize FEB2s information and apply digital filtering to the signals of each LAr calorimeter cell. Custom ASIC will be the technology used for the FEB, while for the LSPB it will be FPGA.

Instead, in the central region of the hadronic calorimeter, the Tile Calorimeter (Tile-Cal) [29] will have the same position and goal as in Section 2.2.2. Fig. 3.4 shows the scheme and position of the Tile Calorimeter in the scenario of ATLAS Phase-II. The sub-detector will capture roughly 30% of the jet energy and, similarly to the previous runs, it will be of crucial relevance in jet and missing energy measurements, jet substructure, electron isolation and triggering.

## 3.1.4 Muon Spectrometer

The Muon Spectrometer (MS) [30] during Phase-II requires a significant improvement in terms of performances and of precision in the track reconstruction by the triggering system. The current muon spectrometer provides a Level-1 (L1) hardware muon trigger according to the coincidences of hit within different detector layers. A software confirmation of it is performed by the high-level trigger which exploits refined  $p_T$  measurements from the precision chambers. For the Phase-II upgrade the muon spectrometer will pro-



Figure 3.5: Scheme of the active areas and layout of the MS Phase-II.

vide a finer granularity trigger based on Monitored Drift Tube (MDT). This improves the sharpness of the transverse momentum threshold at L1 and, if the L0 latency allows it, also at L0. Concerning L0, its coverage and redundancy will rise, due to the addition of the new Resistive Plate Chamber (RPC) detectors that will be placed in the barrel region ( $|\eta| < 1$ ). Moreover, for the Phase-II upgrade, the original on-chambers electronics for the MDT may be partially or completely replaced. The final configuration of the upgraded Muon spectrometer is shown in Fig. 3.5. The trigger electronics is upgraded both in the barrel and the end-cap spectrometers and, in order to improve the trigger selectivity, a replacement of the trigger chambers in the forward region ( $2.0 < |\eta| < 2.4$ ) is expected. The new ATLAS requirements of the L0/L1 trigger system are automatically applied to the readout electronics are not able to sustain the rate of 400 kHz, and it must be replaced with an improved one. Eventually, another important reason for the upgrade of the trigger electronics consists of the necessity for an improved selectivity for high  $p_T$  tracks, which calls for a better space resolution in the bending direction ( $\eta$ ).

In the Reference scenario, the muon acceptance will grown thanks to the inclusion of a very forward muon tagger, attached to the New Small Wheel (NSW) shielding disk and covering the range of  $2.6 < |\eta| < 4.0$ . The main modifications for this scenario involve the barrel and end-cap detectors. In the barrel region, the RPC is added with also a small tube diameter MDT (sMDT) in the small sectors of the Barrel Inner layer (BI). In the latter the already existing on-detector electronics will be replaced. In the end-cap region, the MDT front-end readout will be replaced and a very forward tagger is included to take advantage of the very-forward ITk tracking, which in the Reference scenario has the largest extension. The modifications are required for this new run period because, for instance, present MDT read-out electronics cannot cope with the expected trigger rate. For this reason, the new trigger system will have to maintain a high level of efficiency, that consists of finding high- $p_T$  tracks and keeping the rate of fake triggers low. Moreover, always in the end-cap region, the L0 trigger electronics of all the chambers (excluding the NSW) will be replaced. Indeed, the current read-out electronics of the trigger chamber systems do not supply information on the pulse height of the signals, therefore position measurements cannot be performed by using charge interpolation.

## 3.2 Trigger and the Data Acquisition

The HL-LHC upgrade represents a significant challenge also for the Trigger and the Data Acquisition (TDAQ) system of ATLAS [31]. In order to perform physics research mentioned in Section 3.1, it is necessary a high efficient event selection regarding the Higgs boson and new physics. Considering the larger step increase in the luminosity, exceptional performances from the trigger and data acquisition system are required to sustain a higher maximum rate and a longer latency. Similarly to what happened for most of the components of the ATLAS detector, also the upgrade of the TDAQ system comes through different evolution stages in terms of designs and technologies compared to the initial plan.

A general architecture layout is given in Fig. 3.6, where there are the main three different subsystems which compose the ATLAS TDAQ: the Level-0 Trigger, the DAQ (Readout and Dataflow subsystems) and the Event Filter. In the Reference scenario, during LS2 the front-end electronics of all the existing ATLAS detector systems are replaced, excluding the systems already upgraded during the LS1 for the Phase-I. The architecture that will be chosen has passed through three different designs: the baseline, the evolved and an alternative. The first one is a single level trigger (L0-Level) which was planned to evolve with the inclusion of a L1track level, which differs compared to the alternative in terms of the technology exploited:

• a hardware-based system (i.e., HTT), made of custom-designed Associative Memory (AM) Application Specific Integrated Circuit (ASICs) for pattern recognition



Figure 3.6: Design of the baseline L0-only of the TDAQ Phase-II architecture. The upgrade project is divided into three main system: Level-0 Trigger, DAQ (Readout and Dataflow subsystems) and Event Filter.

and Field Programmable Gate Arrays (FPGAs) for track reconstruction and fitting;

• a software system, mostly Central Processing Unit (CPU)-based servers with or without accelerators (e.g., GPGPUs).

In the initial plan of the Technical Design Report (TDR) of the TDAQ Phase-II, the baseline for EF Tracking is the so-called Hardware Tracking for the Trigger (HTT). The latency requirement suggested an evolved configuration of this hardware tracking system with the inclusion of a new level of trigger, Level-1 (L1), which would have processed regional data from the strips and the outer Pixel Detector layers. The current status of the research is to remove the option to evolve to a L0/L1 system, thanks to the improved resource-efficiency of the software, that allows to remove the latency constraints. Therefore the decision is to refine a new baseline. Since the methods developed for track reconstruction in both TDR baseline and evolved scenarios represent viable options for the new EF Track alternative implementations, more details about them are soon presented. In Section 3.2.3 the pattern matching method of the baseline relies in the AM ASIC technology, while in the next chapter is described the method developed for the evolved design, which exploits Hough Transform algorithm implemented on FPGAs.

## 3.2.1 Baseline Architecture

The baseline architecture initially selected for the TDAQ system consisted of three levels: the Level-0 trigger, the DAQ (Readout and Dataflow) and the Event Filter. The sub-parts of the trigger itself and the baseline design for the TDAQ architecture are shown in Fig.3.7. The hardware-based L0 trigger system is composed by four different subsystems:

- The Level-0 Calorimeter Trigger (L0Calo) is based on the Phase-I L1Calo system with a new forward Feature EXtractor(fFEX), to ensure an efficient electron identification in the region  $3.2 < |\eta| < 4.0$ . Electrons,  $\tau$ -leptons, jet candidate are reconstructed through coarse-granularity data that can also calculate the missing transverse energy  $E_T^{miss}$ .
- The Level-0 Muon Trigger (L0Muon) is completely upgraded compared to the



Figure 3.7: Design of the TDAQ Phase-II baseline architecture where the trigger is in purple. The single level hardware-trigger L0 is composed by the L0Calo and L0Muon sub-systems, it steps through the Global Trigger and eventually reaches the CPT. The L0 trigger dataflow is represented by the black dotted arrows, while the full black lines represent the readout dataflow, which will start only if the detectors and the FELIX recive the L0 accept signal (dashed purple line). Readout is in green, dataflow in yellow and Event Filter in red [31].

Phase-I system. The sub-system employs the upgraded logic of the barrel and endcap sector and the NSW trigger processors to reconstruct the muon in the barrel, forward and end-cap regions;

- Global Trigger is a subsystem of the L0 trigger system that will perform offline-like algorithms on full-granularity calorimeter data. It will also identify topological signatures that can include a wide variety of four-vector combinations involving sums, angles and invariant masses.
- Central Trigger Processor (CTP) subsystem considers the trigger menu configuration, prescale factors, and dead-time requirements to form the final Level-0 decision, which is transmitted as a L0A signal applying flexible pre-scales and vetoes to trigger items.

The MUCTPI (Muon CTP Interface) provides an interface between the barrel and end-cap components of the L0Muon system on one hand, and the Global Trigger and CTP on the other. It identifies muon candidates that are counted twice in the L0Muon system (overlap removal) and calculates multiplicities for various transverse-momentum thresholds. The L0Calo and L0Muon/MUCTPI subsystems send their selected objects to the Global Trigger, including spatial locations, reconstructed energy/momentum values and discriminant variables. These objects are then combined with the results of the Global Trigger calorimeter processing to refine the e/gamma, tau, muon and jet selections.

Following the Level-0 trigger decision, the resulting data are transmitted over custom point-to-point serial links, to the Readout subsystem within the DAQ system. The first element of the Readout is the Front-End Link eXchange (FELIX) subsystem and it provides a common interface between the detector-specific custom point-to-point seriallinks and the commodity multi-gigabit data network downstream. Data are received along the network by Data Handlers, where detector specific processing (such as formatting and/or monitoring) can be implemented before the Dataflow subsystem, in which data are buffered. The Readout subsystem is designed to handle 1 MHz event rate, for a total bandwidth of 5.2 TB/s. In order to endure this 1 MHz input rate of data, a large Event Filter (EF) processor farm is required and commodity-CPU-based event processing would have been handled by the Hardware-based Tracking for the Trigger HTT subsystem, that was designed to provide the fast hardware-based track reconstruction for the TDR baseline TDAQ. Regional tracking by the regional HTT (rHTT) allows a fast initial rejection in the EF of single high- $p_T$  lepton and multi-object triggers from background processes, to reduce the rate to around 400 kHz. This system is specified to operate at 1 MHz and use up to 10% of the ITk data, by selecting tracking modules in regions based on the results of the Level-0 trigger system. Software-based reconstruction follows to achieve further rejection. This is aided by global tracking at around 100 kHz using the global HTT (gHTT) but in this case with the intent to produce tracks closer to offline quality, suitable for b-jet tagging,  $E_T^{miss}$  soft term calculation, soft jets and pile-up suppression. Events selected by the EF are in the end transferred to the permanent storage of the ATLAS offline computing system. The raw output event size is expected to be 6 MB, and the total trigger output is expected to be 10 kHz; thus, the total bandwidth out of the system is 60 GB/s.

### 3.2.2 Evolved Architecture

The solution of a single hardware trigger considered by the TDR baseline TDAQ system had two critical aspects which involved the main impacts on the performance of the system itself: the hadronic trigger rate and the inner Pixel Detector layer occupancy must not become higher than their expected values. With the purpose to fix the issues that could have been raised in case in which this happens, TDAQ hardware-based trigger infrastructure was designed to "evolve". This solution relied on the inclusion of a new level of trigger, Level-1 (L1), which should have processed regional data from the strips and the outer Pixel Detector layers. This two-level development architecture included a L0 trigger rate up to 2-4 MHz and 10  $\mu s$  latency, followed by a L1 trigger rate of 600-800 kHz and latency up to 35  $\mu s$ . ITk PD at this point should have required a higher readout rate of 4 MHz for the layers 2 to 4 where the data selection for L1 needed off-detector in the Readout System based on Global Trigger information. ITk Strip Detector could reach 4 MHz at L0 without changes. The architecture design of the evolved system is shown in Fig. 3.8.

In the evolved case ATLAS would have been touched entirely with different solutions



Figure 3.8: Design of the TDAQ Phase-II evolved architecture. In purple the L0 trigger is composed by the L0Calo and L0Muon sub-systems, the Global Trigger and the L0CPT. Readout is in green, dataflow in yellow and Event Filter in red. In light blue the L1 trigger is composed of L1Track and L1CTP. The black dotted lines represent the L0 and also L1 dataflow. The full black lines correspond to the readout dataflow at 1 MHz, while the dashed black lines represent the readout dataflow at 800 or 600 kHz [31].

in all its parts:

- The hardware trigger was split into a two-level hardware trigger system where HTT should have been the primary reduction operator of L0 for an EF affordable farm size;
- A Region of Interest Engine (RoIE) was added to the Global Trigger in order to calculate Region of Interest (RoI) dependently from L0;
- The data from the ITk strip and ITk outer pixel layers were here used in the Readout system to select the relevant data for L1Track. Similarly to the rHTT of the baseline system, the Readout system was reconfigured to be able to reconstruct tracks within 6  $\mu s$  of latency, but only for tracks with  $p_T > 4 \ GeV$ . It should have reconstructed tracks in the RoIs, which would have been composed of the same hardware and firmware as the HTT components, but with an additional latency of 10  $\mu s$ ;
- Right after the track reconstruction, the resulting trigger was combined with calorimeter and muon-based trigger objects in the Global Trigger;
- the CTP formed the L1 decision.

Resuming, the main differences between evolved and baseline scenarios are the L1Track, the L1CTP and the shift of rHTT from the EF to the L1 Track. In this design, the hardware-based track reconstruction was planned to be implemented in the L1 trigger system, through the reconfiguration of a part of the HTT. Therefore, the main characteristics of the trigger chain are almost the same, excluding that in the evolved system, right after the L1, the whole data rate of 1 MHz is shared between the full and regional detector readout. Hence, the regional readout gets priority over the full readout, with respectively a rate of 200-400 kHz and 600-800 kHz detector readout.

## 3.2.3 Hardware Tracking for the Trigger

The Hardware Tracking for the Trigger (HTT) system was the hardware-based system that was created by the ATLAS collaboration with the aim to reconstruct tracks inside the TDAQ during HL-LHC running. The current decision of the new Event Filter Tracking (EFT), which was launched after HTT, is to commit to the different commodity based solution. This comes from the choice of previous HTT to create two task forces with the purpose to investigate the technologies suggested for both baseline and evolved scenarios.

For this reason, in this section the HTT system for both of the options is described. The HTT structure is divided into rHTT (regional tracking), that finds the tracks in Region of Interest (RoIs), and gHTT (global tracking), that reconstructs the tracks in the entire ITk coverage. For the baseline solution of Section 3.2.1, gHTT provides global tracking with  $p_T > 1$  GeV, while rHTT gives regional tracking with  $p_T > 2$  GeV. In the evolved scenario of Section 3.2.2, instead, the regional tracking is provided by the L1Track, for tracks with  $p_T > 4$  GeV.

The HTT system performs the track reconstruction in two main steps. Initially, clusters from an eight-layer subset of ITk layers are matched to predefined patterns using a custom-designed AM ASICs, the same technology used in Run 1. This step is the same for both the regional and global reconstruction. In a second step, clusters which are compatible with particle tracks are sent to a fast track-fitter implemented in FPGAs.

In the end, the output produced consists of candidate tracks with also the associated  $\chi^2$  of the track-fit. A second-stage fit is eventually performed by the gHTT system, using clusters from the rest of the ITk layers once they are recovered through the first stage fit.

The latency requirement on the HTT depends on how long data can be buffered and how fast ITK can be read out. In the baseline scenario there is no latency requirement on the HTT because data can be buffered in EF for seconds. On the other hand, in the evolved trigger scheme, data has to be buffered in the read out electronics and in the ITk has to be read out at a much higher rate. Based on these considerations, the latency requirement for the last option is  $6\mu s$  on the regional tracking. This fact has important consequences on the technology chosen by HTT for the reconstruction of tracks and the fitting that is a hardware-based system made of AM ASICs and FPGAs. In fact, a system based on processors and GPUs have longer latency and would require larger buffers, thus more hardware memory. Moreover, in hardware logic, the silicon use is tailored for the specific application, resulting in a lower consumption compared to the general-purpose CPUs and GPUs.

Due to the risks connected to the development of a new ASICs, such as design deadlines, performances and power budget, for the evolved scenario, a new solution that pointed to a clustering filtering method based on the Hough transform algorithm for the track reconstruction was investigated as an alternative to the AM pattern matching technology. A better description of this method will be done in the next chapter, since it is also under study as an option for the final scenario of the EF Track alternative implementation.

#### Associative Memories

Associative Memory (AM) ASICs relies on a technology called content-addressable memory (CAM). The task of this kind of memory is to compare the input data to the one stored, and give the address of the data that best match it (if it exists). This function can be exploited by the AM ASICs to compare a set of ITk clusters to predefined patterns from tracks. This pattern matching method for track reconstruction was developed for the HTT baseline scenario and one of the goals of the Task Force dedicated to the optimised custom architecture was to re-optimize pattern banks. The pattern matching is illustrated in Fig. 3.9 and a general idea on how this can be performed in the new ITk is presented in this section. The inputs of the HTT are hits from the eight ITk pixel layers and they firstly need to be organized in clusters through clustering algorithm and converted into groups of consecutive ITk silicon strip or pixel channels, called "super strips". A track traversing the detector will trace out a pattern formed in each layer by one super strip to which a unique SuperStrip Identifiers (SSIDs) is associated. Therefore a single pattern describes a sequence of eight SSIDs in different layers of the detector. In the hardware implementation, the AM ASICs store a large bank of pre-computed and simulated patterns called roads. These patterns represent physical regions of the detector defined by the physic events of interest. Each bank is stocked with an address in the memory and its patterns are generated using large samples of simulated muon tracks with track parameters that cover the RoI. Then AM ASICs programmed with a pattern bank match them to super strip giving as output the address of the best matching pattern. CAM often have the so called "don't care bits", or termary bits, used to combine two or more similar patterns to one, which frees up space to include more patterns and increase the track finding efficiency. "Don't care bits" match independently from the input. For each matched hit pattern, the track parameters and quality are computed from the corresponding full-resolution hits in a FPGA. While rHTT tracks obtained from the eight ITk layers are here completed, for gHTT, each track candidate goes through a second stage of processing. Track fit candidate is extrapolated to the remaining layers of the gHTT. Eventually, with another stage, a full track-fit can be produced. The AM chips use a subset of eight ITk layers to create a bank composed of patterns with eight super strips each. To cover a RoI of  $0.2 \times 0.2$  in  $\Delta \eta \times \Delta \phi$ , approximately one million of patterns are needed for tracks of  $p_T > 2$  GeV.

As already mentioned in the previous section, in the baseline solution, the AM ASICs use a subset of eight layers of the ITk to construct a bank of patterns, hence in each pattern there will be eight super strips.

One example on how the pattern matching method works is shown in Fig.3.9. Considering only four different layers, adjacent strips or pixels are combined into the super strips. For a particle traversing the detector, the track pattern will consist of one super strip in each layer of the detector. As a consequence, the use of a large set of simulated tracks allows to create a bank of patterns. The AM chips will try to match an input pattern and the output that will produce is the address which best suits the input pattern.

#### Track fitting

The clusters which overcome the match comparison into the AM, have to pass subsequently into FPGAs where a track-fitter is implemented. However, track fitting is performed in two different stages. Initially, the FPGA in the Pattern Recognition Magazine (PRM) takes the full-resolution hits from the roads selected. Then the track parameters,  $p_T, \phi_0, \phi, \eta, d_0$  and  $z_0$ , and the  $\chi^2$  of the fit are consequently extracted. To compute the track parameters, the track-fitter uses the linear interpolation:



Figure 3.9: Example of the AM ASICs pattern matching in four detector layers, each one composed of a single module with six strips combined into three super strips, and the pattern banks contains five possible patterns. The X represent the "don't care" bits [32].

$$p_i = \sum_{j=1}^{N} C_{ij} x_j + q_i, \qquad (3.1)$$

where  $x_j$  are the full-resolution local cluster coordinates and  $C_{ij}$ ,  $q_j$  are fit constants specific for each so-called sector, that consists of one module of each layer combined for all eight layers of ITk. The values of one sector are determined from a large sample of simulated muon tracks, which have the same parameter ranges and distributions as the ones used in generating the patterns. The goodness of the fit is find through the linearized  $\chi^2$  method:

$$\chi^2 = \sum_{i=1}^{N-5} \left(\sum_{j=1}^N A_{ij} x_j + k_i\right)^2, \tag{3.2}$$

where the additional constants  $A_{ij}$  and  $k_i$ , obtained in the same way as previously, are necessary for each sector. As already said in the previous sections,  $\chi^2$  is faster to be computed with FPGAs technology. Both fitting constants and constants used to compute the quality of the fit have to be stored in internal memory on FPGA, external memories, or both of them. Since the goodness of fit and its parameters need to be retrieved from memory for each sector in the event, the fitting hardware could come across a bottleneck issue. To have a quantitative idea, in order to fit a region of  $\eta \times \phi = 0.2 \times 0.2$ , several thousand sectors are required. This corresponds to about forty million coefficients that need to be stored in external memories of the PRM or in the internal FPGA memory. A single fit is estimated to take 1 s and the latency budget of L1Track can accommodate about 1000 fits, therefore the number of fit represents an important study to take into account.

In the second stage for each track, the FPGA in the Track-Fitting Magazine (TFM) calculates the five helix parameters and the  $\chi^2$  using the same equations as in the first step, Eq. 3.1 and Eq. 3.2. In addition to all the hits from the detector layers not used by the PRM, TFM also receives the eight layer tracks from six PRM cards. The TFM implements two functions: the *extrapolator*, which finds near the PRM track the hits on the additional silicon layers, and the *track fitter*, which fits the hits on the PRM track

combining each hit on the other layers, with a  $\chi^2$  cut.

The final stage for the candidate tracks removes duplicates of the same track obtained, and eventually sends the true track candidates to the Event Filter.

## 3.2.4 A new baseline

The decision of ATLAS to exclude the possible evolution to a L0/L1 trigger system for the Phase II upgrade, has removed the low latency requirement. This choice is due to the software tracking improvements which results both from software and detector layout enhancement. Currently, the software tracking is seven times faster than the expectations of the TDR ATLAS TDAQ system [31], and it can be a viable backup option. The designs that were considered after the exclusion of the evolved scenarios are three: an optimised custom architecture, an Heterogeneous commodity architecture and Softwarebased architecture for tracking. The first one exploits relaxation of latency requirements simplifying the hardware design. It re-optimize AM pattern banks considering FPGAs as an alternative for pattern recognition. The software option, instead, builds on optimised dedicated software for the Alternative EF Tracking. The already existing hardware corresponds to the EF CPU farm. Additional costs and power however must be taken into account. The heterogeneous commodity architecture is a mixed commodity platform of classic processors and accelerators. It is built on initial assessment of Hough Transform on FPGA by Alternative EF Tracking.

The project for the selection of the option was launched in Spring 2021 and two ad-hoc Task Forces (TF) were born: one for the Optimised custom architecture and the other for the Heterogeneous commodity architecture. The already existing organization for the software tracking was expected to continue to develop its option. The goal of each task force was the production of one engineered solution to prove the feasibility of its specific approach. Then, the different alternatives were compared in terms of technical feasibility, estimated tracking performance, operational procedures, opportunities for improvement, risks and resource requirements. The custom-based solution has currently no clear competitive advantage compared to a commercial solution. Conversely, it carries a significantly higher risk, which is inherent to all custom developments and systems. Therefore, it has been concluded that the implementation of tracking functionality in the Event Filter of the new baseline will be based on commercial hardware. Based on these considerations, it was decided that the TDAQ should commit to a commodity based solution for EF tracking at HL-LHC. To follow and evaluate commodity computing technologies and to further develop and optimize efficient algorithms for commodity platforms (CPUs and accelerators) an ambitious program is carried out by the Event Filter Tracking. A variety of high performance accelerator technologies, system architectures, and implementation languages thus will be investigated. The results of these studies will contribute to the choice of the final EF tracking technology.

## Chapter 4

## Hough Transform for tracking

The architecture of the Trigger and Data Acquisition system of ATLAS Phase-II has passed through different designs from the initial Technical Design Report. Focusing on the implementation of the tracking functionality at the Event Filter level, the current choice converges towards a baseline significantly different from the initial plan. For this purpose, two new task forces were born with the aim to investigate a variety of high performance accelerator technologies and system architectures for hardware-based and software-based tracking implementation. As final solution, studies of a heterogeneous solution which relies on commercial hardware are performed by the Event Filter Tracking which is currently assumed to perform track reconstruction exploiting the Hough Transform algorithm.

## 4.1 TDAQ overview

The Trigger and Data Acquisition (TDAQ) system of ATLAS Phase-II [31] will be designed to conform the integrated luminosity of up to  $3000/4000 \ fb^{-1}$  that will be reached at the end of Run 4 and the new ITk design. In addition to the physics goals mentioned in Section 3.1, the HL-LHC dataset will also provide an important opportunity to search for long-lived particles (LLPs) in final states as expected by Beyond Standard Model (BSM) theories. The increased luminosity must be considered in this study since the corresponding background is low. However, LLPs are often trigger-limited due to their unusual final states. Therefore, the Event Filter (EF) should provide large-radius tracking (LRT), which focuses on tracks with high impact parameters such as those resulting from the decays of LLPs. The requirements for EF tracking, as already mentioned in Section 3.2.1, are presented in the TDAQ TDR [31]: the EF shall be designed to work at a luminosity of  $\mathcal{L} = 7.5 \times 10^{34} \ cm^{-2} \ s^{-1}$ , with a maximum input rate of 1 MHz and a maximum output rate of 10 kHz. The final decision regarding how the tracking reconstruction is going to be accomplished by the TDAQ system of ATLAS during Phase-II has not yet been made and it will be implemented and installed, at last, during the Long Shutdown 3 (LS3).

The architecture of the upgraded TDAQ has passed through different designs declared both in the Letter of Intent in 2012 and in the Scoping document in 2015 with a two "custom-hardware" trigger levels, which allows data streaming off-detector either after an initial trigger decision at the full 40 MHz bunch crossing rate, to a baseline architecture, with a single-level hardware trigger that features a maximum rate of 1 MHz and 10  $\mu s$  latency. Subsequently, the system was planned to evolve to a two-level L0/L1 trigger. The Hardware Tracking for the Trigger (HTT) system was initially chosen as the hardware-based technology for the Event Filter (EF) tracking for both scenarios. Initially it was planned to exploit Associative Memory (AM) Application Specific Integrated Circuit (ASICs), the same technology used for Run 1. However, considering the risks connected to its development, it was proposed a different solution that relied on a tracking method based on the implementation of the Hough Transform (HT) algorithm on Field Programmable Gate Arrays (FPGAs). The recent software tracking improvement allows to reject at the present status the evolved L0/L1 scenario, which results as a lowering in the latency constraints, but the implementation of the HT for tracking is still taken into consideration. The commodity based solution for EF tracking now under study is an approach Central Processing Unit (CPU) or accelerator-based, which aims to produce track reconstruction with lower costs and higher performances. In particular, the heterogeneous commodity architecture exploits both technologies. The new Event Filter Tracking (EFT) subsequent to the HTT, plans to implement the HT as the tracking algorithm and its proposal is compatible with the heterogeneous architecture.

This is the solution that this thesis aims to study with the purpose to contribute to

the proposal for an engineered solution for EF tracking.

### 4.1.1 Commodity accelerators

The ATLAS Alternative EF Tracking Working Group presented in 2019 a promising alternative implementation of EF tracking using commercially-available FPGA Peripheral Component Interconnect Express (PCIe) cards housed in rack-mounted servers [32]. The heterogeneous systems integrate various computational units such as multi-core CPUs, GPUs and FPGAs to perform the required computations more quickly, thus with lower latency. In addition, with this kind of system, higher performance can be achieved with lower power consumption, to satisfy electrical power and cooling constraints, as well as rack-space limitations.

The application of FPGAs as server accelerators in data centers is relatively new, compared to its 30-year history into the electronics industry. For server use, this technology is packaged as acceleration cards that connect to a PCIe slot in the motherboard of a server. These kinds of commercial accelerators are widely available. In general, their use benefits the CPU-based applications: latency decreases due to the high level of parallelism inherent in their architectures and power consumption decreases as well. As already discussed, latency is not an issue for the EF tracking application since buffering capabilities in the EF are sufficient to effectively remove any real latency constraints from the system. On the other hand, power is a limited resource for the EF. Although the CPU-based EF is comfortably within the total power budget, the usage of accelerators as part of the system can favor the increase of flexibility to expand the EF farm.

## 4.2 System overview and functional blocks

The new project following HTT provides the reconstruction flow in terms of the functional blocks of Fig. 4.1 and its goal is to demonstrate the system feasibility.

The architecture is still under study in order to fully optimize the design that will be compared finally with the other solutions in terms of performance. The input of the system considered is the data coming from the ITk, where each event is assumed to be 1.6 MB for the pixel, while 0.5 MB from the strip. The total amount of data from



Figure 4.1: Functional blocks of the heterogeneous commodity system. In yellow events will be processed with the FPGA firmware implementation, while in green the precision track fit would be implemented in software. The choice of the boundary between the FPGA and CPU may still evolve with further study [32].

the ITk corresponds thus to 2.1 MB. Lightweight CPU-based load-balancing software would route a particular (full) event to an available FPGA in which it is implemented a firmware able to decode these data in the entire event and cluster the hits in each layer. After this, FPGA performs other functions to prefilter and/or gang the ITk hit clusters to form "spacepoints" and/or "stubs", which correspond to a pattern recognition step that aims to analyze a subset of hit clusters or spacepoints to identify hit combinations similar to tracks. In the end, thanks to an initial track fit, the firmware will be able to remove a series of duplicate and fake tracks.

The pattern recognition is the most resource-intensive functionality stage and, as already anticipated, the algorithm expected to be implemented is the HT. The output from the FPGA is a set of track candidates with the hit clusters associated, along with any additional ones. These data would be passed to a precision Kalman filter that was developed for the fast ITk track reconstruction for a final precision fit. The possibility of an extrapolation and/or second-stage track fit implementation on the FPGA has not yet been explored in detail, however there can be different benefits but also detriments. In fact, it would require additional resources, but the quality of the tracks passed to the CPU will improve. Hence, the purpose of the new project EFT, the current successor of HTT, is to analyze through additional studies the overall optimization of a heterogeneous system for the proposal of the solution for EF tracking system of the ITk detector of ATLAS Phase-II.

## 4.3 Hough transform for particle tracking

This thesis aims to test the implementation of the HT algorithm on a FPGA device to the case study of the heterogeneous commodity of ATLAS Phase-II. HT for pattern recognition was introduced initially as an "almost plug-and-play" alternative to the AM ASIC solution within the evolved HTT structure. Since the stock of the pattern banks mentioned in Section 3.2.3 requires a huge amount of local memory, this alternative design was theorized and needs to be tested.

The Hough Transform [33] (HT) algorithm is a tracking method already used in particle physics in general pattern recognition. It was first described in 1959 and applied in the photographic analysis of bubble chambers plates [34]. Recently it is exploited in the high energy physics field, mostly because of the benefits that brings in the computational capacities of Graphics Processing Units (GPUs) and FPGAs [35] [36]. HT is used to extract lines, straight or curved, usually from digitized images, or in general from granular matrices. In Fig. 4.2 is shown a simple example of its application. This type of algorithm implementation in a FPGA device represents an advantage because it is suitable for the ATLAS tracking structures and also has a high level of parallelization. HT for the ATLAS tracker trigger aims to detect high-momentum tracks. The track of a charged particle in the transverse plane (x-y plane) of the ATLAS tracker has the shape of a circular arc described by the transverse momentum  $p_T$  and its initial angular direction  $\phi_0$ . The high hardware performances of such an algorithm, implemented on electronic boards, is tested in this thesis work. The advantage is that its latency time increases linearly with respect to the number of hits, while for combinatorial algorithms the latency time grows much more rapidly with the number of hits. In addition, the HT is much more tolerant to "missing" hits or hits which do not perfectly match for example with a given default



Figure 4.2: Example of Hough transform for the identification of circles. The green, red and blue dots, in the left plot, are transformed into the corresponding coloured lines in the parameter space of the right plot. The intersection of these lines gives the coordinates of the circle centre in the image.

pattern (because of limited resolution). Its implementation is expensive for a limited number of input hits, but it is still convenient because its performances are significantly improved with a large set of hits.

## 4.3.1 Hough transform for circles

This section is dedicated to the derivation of the HT, useful in the physics context of this research: the motion of charged particles inside a magnetic field. In the presence of any electromagnetic field, charged particles are subject to the Lorentz force:

$$\mathbf{F} = q\mathbf{E} + q\mathbf{v} \times \mathbf{B},\tag{4.1}$$

where q is the electric charge of the particle,  $\mathbf{E}$  is the electric field,  $\mathbf{v}$  is the velocity of the particle and  $\mathbf{B}$  is the magnetic field. In the case of the ITk, electric field can be considered negligible and the magnetic field along the z direction is uniform ( $\mathbf{B} = B\hat{\mathbf{z}}$ ). Choosing cylindrical coordinates so that  $\mathbf{v} = v\hat{\phi}$ , the Lorentz force then becomes:

$$\mathbf{F} = qvB\hat{\mathbf{r}}.\tag{4.2}$$

Hence, if the momentum of the particle remains constant, its trajectory will be circular and the force can be described through a radial acceleration:

$$\mathbf{F} = \frac{p_T v}{r} \hat{\mathbf{r}}$$
(4.3)

where r is the radius of the circle and  $p_T$  is the transverse component of the relativistic momentum of the particle. Substituting Eq. 4.2 into Eq. 4.3 and dropping the vector it is possible to write the momentum as  $p_T = qBr$ , that in units of elementary charge eand GeV/c becomes:

$$p_T = \frac{ceqBr}{e} \cdot 10^{-9} \approx 0.3qBr. \tag{4.4}$$

The equation above represents the relationship between the transverse momentum of a particle and the radius of the circle it traces in a uniform magnetic field. In order to reconstruct the track of the particle, it is necessary to determine all the possible radii of the circles which pass through at least two points (i.e. the hits that the tracker has registered). In cartesian coordinates, the equation of a circle is:

$$(x-a)^2 + (y-b)^2 = r^2. (4.5)$$

where (a,b) are the coordinates of the center; switching them as polar coordinates  $(rcos\theta, rsin\theta)$ , it is possible to evaluate the equation for a pair of points  $(x_1, y_1)$  and  $(x_2, y_2)$  and subtracting the two results it turns:

$$r(\theta) = \frac{1}{2} \frac{(y_1^2 - y_2^2) + (x_1^2 - x_2^2)}{(y_1^2 - y_2^2)sin\theta + (x_1^2 - x_2^2)cos\theta}.$$
(4.6)

The Eq.4.6 can be simplified breaking out 1/r, using polar coordinates for the points

 $(x_i, y_i) \rightarrow (r_i \cos\theta_i, r_i \sin\theta_i)$  and by substituting  $\theta = \phi_0 + \frac{3\pi}{2}$ , with  $\phi$  azimuthal angle at the closest approach of the track to the beam line (i.e. the angle in the x-y plane of ATLAS with which the particle enters in the tracker). With the usage of trigonometric identities:

$$\frac{1}{2r}(\phi_0) = \frac{r_1 \sin(\phi_0 - \phi_1) - r_2 \sin(\phi - \phi_2)}{r_1^2 - r_2^2}.$$
(4.7)

Recalling Eq.4.4, it is possible to obtain:

$$0.15 \frac{qB}{p_t} = \frac{r_1 sin(\phi_0 - \phi_1) - r_2 sin(\phi - \phi_2)}{r_1^2 - r_2^2}.$$
(4.8)

that is the equation used to parametrize  $p_t$  as a function of  $\phi$  in the Hough transform.

If this equation is applied to pairs of generic points, it is not constrained to any vertex. However, applying it to a vertex constraint, for instance by fixing one of the two points in the origin (i.e. setting the point 2 in the origin means to impose  $r_2 = 0$ ), and keeping the remaining point coordinates one can obtain:

$$0.15 \frac{qB}{p_t} = \frac{r_1 \sin(\phi_0 - \phi_1)}{r_1^2}.$$
(4.9)

Applying this concept to the ATLAS experiment, the modulus of the magnetic field is B = 2 T and the radius is commonly measured in mm instead of m. Eq. 4.9 in this way becomes:

$$\frac{qA}{p_t} = \frac{\sin(\phi_0 - \phi)}{r},\tag{4.10}$$

where  $A = 3 \times 10^{-4} \ GeV \ mm^{-1} \ c^{-1} \ e^{-1}$  and the 1 index dropped. A further simplification of the equation can be done for small values of  $\phi$  with the first order Taylor expansion of  $sin(\phi_0 - \phi_1) \approx (\phi_0 - \phi_1)$ .



Figure 4.3: The left figure represents one quadrant of the transverse plane of the tracker. Black dots are clusters along the signal track and black crosses are those not associated with it in the range  $0.3 < \phi_0 < 0.5$ .  $\phi_0$  present in the figure is the azimuthal angle of the track. The accumulator is the figure on the right and contains the transformed line of the cluster in the transverse plane.

#### 4.3.2 Implementation of the HT

The study for the actual feasibility on the hardware implementation of the HT algorithm was carried out by Mikeal Mårtensson in his PhD thesis [36]. Through the Hough transform of Eq. 4.10 it is possible to turn the cluster position, corresponding to a point  $(r, \phi)$ , into a curve in the track parameter space spanned by the parameters  $qA/p_T$  and  $\phi_0$ , commonly known as accumulator. Fig.4.3 illustrates the algorithm application on the track of a particle. First of all, each arc of circumference in gray represents the layer of the tracker. In each layer it is possible to distinguish the so-called clusters and, more in detail, black dots are clusters aligned along a signal track, while black crosses are those which are not associated to any particular track. These ones correspond to the track hits through the charged distribution in the "hit" pixel, which have higher spatial precision with respect to a hit that, instead, is registered by several next to next pixels. In practice, clusters coming from the same track are transformed into lines via the HT and their intersection in the accumulator represents a possible match for a particle track with the corresponding parameters. Clusters not coming from the same track, or coming from a track outside the parameters considered, form randomly crossing curves in the accumulator. This procedure has to include the whole set of operating tracking layers (i.e. for each set of tracking layers a parameters space is done), which means that the track candidates have to be found with a match within the different layers. To clarify this concept, the following section gives a more detailed description on how the accumulator works.

## 4.3.3 The accumulator

The accumulator has a central role in the HT and its implementation because the whole process can be reduced to the creation of this object. The main procedures performed on it are its filling and the selection of tracking candidates from it. The accumulator is implemented as a two-dimensional histogram and it contains: a boolean value for each discrete detector layer and a list of all clusters going through a particular bin. The boolean is *true* if one or more clusters in that detector layer goes through a particular bin on it or *false* otherwise. This bin content is referred to as layer bits.

In Fig. 4.4 there is an example of an accumulator with a set of clusters registered in the whole set of used tracking layers, only for a single muon event. In the accumulator, the count of bit layers set to *true*, which represents the number of layers with clusters in them, is important because it gives the number of parametrized intersecting curves. In practice, finding a good track candidate means to apply a threshold on the counts of the layer bits and get the intersection point of the parametrized curves. In the plot used as the example, the yellow area represents the region of the parameter space in which a higher number of layers have registered the same hypothetical cluster. However, with the inclusion of the pile-up, it is harder to discriminate the muon signal from the background. This fact is shown in Fig. 4.5 where, indeed, the same muon signal was embedded in the minimum bias events with 200 proton-proton interactions. Applying a threshold decision able to find matches in possible track candidates makes the pattern recognition to the algorithm harder because of the huge amount of overlapped clusters.

This happens because the RoI considered is small in  $\Delta \eta_0 \times \Delta \phi_0$ , but wide in z, since the longitudinal spread of the interacting proton beams is 300 mm. However, in order to enhance this situation, the solution is to split the RoI into smaller slices along the z axis. This is what happens in Fig. 4.6. The occupancy can be reduced with the inclusion of



Figure 4.4: The clusters from a single muon track filled the accumulator. The intensity of the color represents the count of layers of the same cluster.



Figure 4.5: The clusters from a single muon track and minimum bias event, which correspond to 200 proton-proton interactions, filled the accumulator. As can be seen, it is difficult to discriminate track signals within different layers.

accumulators for each added slice.

The splitting boundaries in the z-r plane are represented with:

$$z_{\min,n}(r) = z_0 + n\Delta z + r \sinh \eta_{\min}, \qquad (4.11)$$

$$z_{max,n}(r) = z_0 + (n+1)\Delta z + r \, \sinh \,\eta_{max}.$$
(4.12)

where n = 0, 1, ..., N is the split index,  $[\eta_{min}, \eta_{max}]$  is the  $\eta$  range,  $[z0, z0 + N\Delta z]$  is the z range and r is the radial coordinate of the detector layer. This splitting technique improves the efficiency in finding the track and the rejection of unwanted clusters. For a multi-threaded environment, such as an FPGA, this approach can be very suited. However it is possible that the same hit can show up in multiple accumulators because of nearby splits that overlap. This should be taken into account. Eventually, in the accumulator, a selection of bins can be seen as a track candidate with a set of clusters associated with it. The criterion used to choose them is to select single bins from where the number of clusters in unique layers is above a given threshold. Another simple way to select bins could be to apply another threshold to neighboring bins, in order to cross-check for the presence of clusters in the close bins, but eventually selecting the central bin. Several possibilities for the selection are still under testing. Of course, more elaborate selection could have potentially a better signal efficiency and background rejection, but the system may still be changed before the ultimate implementation.

The results provided by this ultimate selection are subsequently transmitted to the global trigger processor in the DAQ chain. Eventually through the L1 acceptance of the signal, Global trigger will transmit dataflow to the readout sub-system.

The application of the HT algorithm in this thesis work is related to the ATLAS performance.

Fig.4.7 shows the accumulator with the binning needed for one RoI. Each bin stores the information of the clusters coming from different layers. The search of a road is made by checking 5 bins along  $\phi_0$  concurrently: the central bin must have 8 clusters from the 8 layers, the left and right bins at least seven, the left-left and right-right at least six and all the clusters for a specific bin must come from different layers. For example, if three



Figure 4.6: Shape of the region in which the RoI is split along z, to reduce occupancy in the Hough accumulators. The boundaries are defined by equations 4.12. Since nearby splits can overlap, the same hit can show up in multiple accumulators.



Figure 4.7: Representation of the bins constraint to declare a bin of a candidate road in the accumulator. The five counters must increase only if clusters from different layers come across that bin [37].

clusters come from the layers 0, 1 and 2 and three clusters all come from layer 4, in the end their total number is four and not six, because the two clusters coming from layer 4 after the first one are not finally considered. This feature is necessary to improve signal efficiency and background rejection. The number of bins of the accumulator, along with the search of the five-bin road, are the required features to implement the road extraction and make it comparable with the physical performance of the AM ASIC solution.

## Chapter 5

# Demonstrator development for HT algorithm

The electronic group of the University of Bologna is developing a firmware design with the aim to exploit the Hough Transform (HT) algorithm as the tracking method at the Event Filter of the TDAQ of ATLAS Phase-II. The purpose of this Master Thesis is to create the first hardware demonstrator able to test the feasibility of the realized firmware design. To achieve this, a firmware structure has been developed and it relies on the PCIe protocol for data transmission. The final goal is to create an environment compatible with the ATLAS Trigger and Data Acquisition system exploiting test-vectors provided by the ATLAS software. Data used to perform preliminary tests on the ongoing HT firmware describes the clusters from the eight layers of the Inner Tracker. In this chapter, the most important results are discussed.

## 5.1 Hardware framework of Bologna

In the previous chapters an overview of the topics of this dissertation was presented, from the most general HL-LHC plans to focus on the TDAQ upgrade of the ATLAS Phase-II. More in detail, in the new TDAQ particle tracking subsystem different options are under evaluations in terms of costs and efficiency and this work is related to the commercial hardware proposal of the Event Filter Tracking (EFT) regarding the trigger



Figure 5.1: Scheme of the whole chain for the algorithm functioning test. Inside the red dashed line there is the frame of the Bologna group. The HT hardware results must be compared with the one of the simulations of the ATLAS software for the generation of roads [38].

online. The idea at the basis of this option is to exploit the Hough Transform (HT) algorithm as the tracking method to reconstruct roads and implement it on hardware accelerators. This is the solution that the electronic group of the University of Bologna is testing, developing a firmware design for the HT able to perform road reconstruction from a data cluster externally received. Data used are test-vectors that can be either generated by a simulation software or a simulated physics data file from the ATLAS "official" software. In order to verify the correctness of the hardware algorithm behavior, once the preliminary works will come to an end, the system developed in Bologna should be eventually compared with the results of the software provided by the ATLAS collaboration.

The Fig.5.1 illustrates the framework of this project: the whole structure foresees the presence of clusters and roads provided by the ATLAS collaboration. In practice, the test of the correct functioning of the data transmission and the hardware tracking algorithm can be performed through the comparison of the resulting extracted roads with the simulated roads file (based on theoretical predictions made with the simulated clusters file).



Figure 5.2: Block diagram of the University of Bologna HT firmware [32].

### 5.1.1 Hardware HT algorithm

A major overview on the HT firmware design [38] developed by the electronic group of Bologna will be given in this chapter. The main block diagram of the firmware structure that will be implemented on a FPGA device is shown in Fig.5.2, with the macro components embedded inside the architecture. The blocks in which the firmware is divided are optimized to process the higher amount of tasks in parallel. In detail, the input of the firmware is of 8 clusters ( $r_i$ ,  $\phi_i$  with i = 1, 2, ..., 8), one from each layer of the ITk with an output of sixteen, corresponding to a cluster per road, related to the road parallelization. The clock source for the logic should be of 250 MHz and the firmware was initially designed to respect the latency of 175 ns, latency defined as the period from the last input cluster to the first output cluster of a road. Therefore the 8-cluster entry per 250 MHz must be  $8 \times (12 + 16(bits))$ . To transform a point in the ATLAS coordinate system into a line in the HT coordinate system, clusters must be processed according to the HT formulas:

$$\frac{qA}{p_T} = \frac{\phi_0 - \phi}{r},\tag{5.1}$$

$$\phi_0 = \phi + \left(\frac{qA}{p_T}r\right). \tag{5.2}$$

The choice on the formula which should be applied at the entrance of cluster inputs depends on the granularity of the HT phase space  $(q_A/p_t, \phi_0)$  exploited. More in detail, it is defined a discriminating value  $r_{threshold}$  that, for each layer, if the radius r of the cluster is more (less) than  $\Delta \phi_0/(\Delta q A/pt)$ , where  $\Delta \phi_0$  is the bin width, the formula used is 5.1 (or 5.2). This occurs due to the 1/r (or r) slope of the line built with the HT. Since  $r_{threshold}$  can be placed completely outside the range of a ITk layer or in the range of it, in the firmware a fixed formula can be applied to all the layers, and consequently no discrimination online will be necessary, otherwise one layer will require the discrimination.

From now on it is considered the case in which the formula used for all the eight layers is  $\frac{qA}{p_T} = \frac{\phi_0 - \phi}{r}$ , with  $r_{threshold} > \Delta \phi_0 / (\Delta qA/p_t)$ . According to this, the process starts with the calculation of  $qA/p_t$  values for all the  $\phi_0$  bins in all ITk layers in parallel, which means that one cluster per layer corresponds to an input on each clock. The results are then used in parallel to update the accumulator. These operations are performed in the pipeline until all the clusters of an event have been processed. As example, supposing an accumulator of 1200 x 64 bins,  $\phi_0$  and  $\frac{qA}{p_T}$  respectively, the 1200 processes done for each input cluster originating from the eight ITk layers correspond to 1200  $\phi_0$  bins of the accumulator. Thus, every clock period the accumulator must be updated according to the new  $1200 \times 8 = 9600$  information created, which represents the points forming the eight lines (one per cluster). Considering for example that the entire input event is built of 500 sets of eight clusters, the system must load 4000 clusters distributed on the eight layers. When this operation is finished the event converted into the Hough space is loaded in the accumulator and the system latency counting starts as well as the process to search the roads which consists of applying the procedure shown in Fig. 4.7. For the central bin, the "road candidate", corresponding to  $\phi_{0,road}$  and  $qA/p_{t,road}$ , are extracted and the process to search back which input clusters have contributed to that road starts. The latter process can be accomplished by running again the Hough formula by testing,



Figure 5.3: Scheme of the HT firmware logic [38].

for all the 4000 input clusters with  $(r, \phi)$  parameters, which ones have generated the  $qA/p_{t,road}$  given the  $\phi_{0,road}$ . In the end, clusters per road must be sent as output, as already explained, in parallel on up to 16 high-speed lanes. The design of this specific case, divided into several blocks, is shown in Fig. 5.3.

### 5.2 Hardware device for data transmission

In the off-detector region, the first element of the TDAQ of ATLAS Phase-II is the custom-hardware board developed by the FELIX group. This device will be able to use the on detector signals coming with optic fibers and transfer them to several components of the TDAQ until reaching the Event Filter. According to the proposal of the electronic group of the University of Bologna regarding the current Event Filter Tracking R&D activities, data will be processed at this level by a physical board with the HT algorithm, planned to be implemented according to the design of Section 5.1.1. Data that must be processed by the FPGA are sent here from a CPU and then the final information produced is transmitted to the same CPU that, with a high bandwidth transmission, will transfer them to other farms of CPUs for further processing. My personal work consisted of the creation of a firmware architecture, compatible with the ATLAS TDAQ, which emulates data transmission and processing in this part of the TDAQ chain. Test-vectors and output results used for testing the HT firmware feasibility are sent in both directions between a CPU and the FPGA device, in which HT is implemented, through



Figure 5.4: Scheme of the dataflow of the demonstrator. Test vectors representing the simulated physical layers of the ITk detector are sent through a PCIe standard bus from a CPU to the FPGA in which HT is implemented and go back to the CPU once processed. Red lines represent the output while blue lines the input.

the PCIe protocol [39]. This technology is fundamental, since it allows the highest bandwidth with a reliable data transmission. A general scheme of the dataflow of the realized demonstrator is presented in Fig. 5.4.

The PCIe firmware structure used to perform the tests is a custom core called Wupper [40] developed by the FELIX group [41]. It is based on a Direct Memory Access (DMA) and uses user-logic FIFO as buffer memory. Wupper supports specific FPGAs, one of these is the Virtex-7 on VC709 board [42], the one used for the scope of this thesis. Since data must be processed, it is necessary to build a structure on the PCIe firmware architecture which allows the supervision of the datastream in both directions according to the necessities of the HT firmware design. In particular, the constraints involve the control on data loss in input, due to the processing time of the algorithm, and the control on the correctness of output data. The structure was realized with the tool Vivado (R) in the VHDL language and consisted of the implementation of two First-In-First-Out (FIFOs) before and after the HT included on the Wupper core. This lets the additional data storage needed to not lose data while, for instance, the algorithm is still processing and then cannot accept input clusters.

Therefore, the work performed in this Master Thesis also gave the opportunity to



Figure 5.5: Experimental set-up at the laboratories of the University of Bologna.

include a new structure to the already existing HT firmware, in order to adapt it to a PCIe firmware. Even if the set-up used will not be part of the EF of the ATLAS TDAQ Phase-II, the realized structure will be incorporated to the final architecture thanks to its generality and flexibility (changing simply the constraints connected to the board used). However, tests performed with the initial set-up will be able to confirm the feasibility of the proposal as well as have an initial idea of the resources exploited before being implemented in the future EF hardware.

#### 5.2.1 Experimental Set-up

One of the boards supported by the *Wupper* is the VC709. This is the evaluation board for the Virtex  $(\mathbb{R})$ -7 FPGA and it provides a hardware environment for developing and evaluating designs targeting the Virtex-7 XC7VX690T-2FFG1761C FPGA. Several features common to many embedded processing systems are provided, such as the dual DDR3 small outline dual-inline memory module (SODIMM) memories, an 8-lane PCIe interface, general purpose I/O, and a UART interface. A complete description of the board features can be found in Ref.[42]. An important characteristic to specify is that the Virtex-7 FPGA on the VC709 is based on the technological node of 20 nm, a specific production methodology for the transistor.

Fig. 5.5 shows the set-up for the realization of the tests which exploited the Test Stand PC used for the FELIX environment.

#### 5.2.2 Data transmission through PCIe

The Peripheral Component Interconnect Express (PCIe) [39] is a high-speed serial computer expansion bus standard, which is able to transfer a high amount of data with high speed and performance. PCIe ensures reliable data transmission between devices. For this work in particular a PCIe gen 3 with 8 lanes is selected and allows a bandwidth of 985 MB/s per lane [43]. The more advanced encoding scheme of 128b/130b ensures the highest payload achievable compared to the other existing data encoding methods. Wupper core is a module of the firmware based on PCIe structure. It relies on a Direct Memory Access (DMA) that uses as buffer memory the user-logic FIFO. Wupper provides an interface to a standard FIFO and runs at 250 MHz. In particular, the core is designed for the 256 and 512 bit wide AXI4-Stream interface [44] of specific FPGA and boards, and can use different generations of PCIe (3 or 4). In my thesis, Wupper is used for the realization of the hardware-demonstrator which mainly exploits the function of "loopback", with the task to send data from the CPU to the FPGA and back to the CPU. The specific software tool used to perform this function is defined as *dma*transfer. However, two modifications are mandatory for the scopes of this thesis: select test vectors as the data used for the transmission and, before going back to the CPU for subsequent analysis in the TDAQ chain, data must be processed by the HT. Regarding the first consideration, the software tool provided by the Wupper was modified to take as input the file containing test vectors, named *TestVectors.txt*. To apply the second modification, a structure which avoids losing data of the input clusters was built.

The two FIFOs of Wupper are called *FromHostFIFO* and *ToHostFIFO* and, as their name suggests, datastream direction for "loopback" is from *FromHostFIFO* to *ToHost-FIFO*. Considering how they are built, to have a correct datastream, once they are filled, the data must be read immediately to have both of them always empty. This represents a constraint within which the HT firmware developed up to now cannot work correctly since it will lose some data while processing. Moreover, the output data of the HT are

not always acceptable. These issues can be fixed by the addition of two FIFOs.

#### Creation of two FIFOs structure

From the preliminary studies reported in Section 5.3, considering how the ToHost-FIFO and the FromHostFIFO are developed inside of the Wupper, to assure correct datastream the two FIFOs require emptying itself once filled with a 256 bits word. This task is performed each clock cycle until the end of the dataflow. Therefore it is not possible to control their *read* enable flags since the code will come through software issues. In addition, the HT described in Section 5.1.1 will require a certain variable processing time which depends on data used and, during this time no data can be further processed. Also, the HT firmware needs a storage of the output data to perform other new functions that will be integrated in the future. For these reasons the datastream must be controlled by exploiting the *HT* valid flag on the HT firmware of Section 5.1.1. This flag is connected to the input acceptance by the HT firmware. Moreover, not all output data are correct and then another control flag HT outdata valid is used by the firmware. Due to these constraints, a two-FIFOs structure was developed and placed between *FromHostFIFO* and *ToHostFIFO*. In this way data can be stored before and after the HT is processing and the task demanded is fulfilled with high performances. A general representation of the all FIFOs final structure in terms of datastream and control flags is presented respectively in Fig. 5.6, Fig. 5.7 and Fig. 5.8 where the new two FIFOs added are called respectively FIRST FIFO and SECOND FIFO.

In general, in order to have a correct data stream (to neither copy data nor lose them) a FIFO cannot be read when it is empty and cannot be written when it is full. The same concept applied to the two FIFOs (without HT and Wupper FIFOs contraints) consists in the case of FIRST\_FIFO:

- must not be read when it is empty and the SECOND\_FIFO is full;
- must not be written when it is full and FromHostFIFO is empty;

while for SECOND FIFO:

• must not be read when it is empty and the ToHostFIFO is full;



Figure 5.6: Scheme of the final FIFOs structure reproduced with the focus on the dataflow. As the Wupper FIFOs names suggests, datastream is in the direction from FromHostFIFO to ToHostFifo.



Figure 5.7: Scheme of the final FIFOs structure, without the HT integration, reproduced with the focus on the main important flags to control. Constraints on the FromHostFIFO and ToHostFIFO are considered.



Figure 5.8: Scheme of the final FIFOs structure, with the HT integration, reproduced with the focus on the main important flags to control. Constraints on the FromHostFIFO and ToHostFIFO are considered.

• must not be written when it is full and FIRST\_FIFO is empty;

A better view of these behaviors is shown in the preliminary test of Fig. 5.10.

### 5.3 Preliminary tests

This section is dedicated to all preliminary tests realized in order to build the hardware demonstrator on the Virtex-7 on VC709. These tests are performed to study: the Wupper FIFOs behavior, the reliability of the transmission with the developed structure of Section 5.2.2 and the mapping of the bits transfer. Once these information are checked, the test with the integration of the HT firmware can be carried out. All tests rely on the customizable Integrated Logic Analyzer (ILA) IP core [45] provided by the *Hardware Manager* of Vivado. ILA provides the online signals behavior, to study in this case the on-board data acquisition. Flags and data signals were probed in each test during the transmission according to the loopback function. In particular, flags used are active-high, meaning that the high logic level ('1') represents the activation of the function of the FIFO (i.e. empty='1' indicates that the FIFO is empty, and similarly for the *full, wr\_en, rd\_en*). Along with the study of the signals, to strengthen the conclusions, the software was modified to create an output file containing data produced after the execution of the loopback operation. Then the final data are, for each test, compared with the expectations.

#### 5.3.1 PCIe FIFOs

The first test was carried out to verify the reliability of the PCIe transmission and to observe the behavior of control flags of both Wupper *ToHostFIFO* and *FromHostFIFO*. The signals which activate during the data transmission in the original configuration were studied with the ILA. Fig. 5.9 reports the results of the measurements, showing in particular how the *empty* signal of the *FromHostFIFO* controls its *read enable* flag, while the *full* signal of the *ToHostFIFO* controls its *write enable* flag. Managing the data transmission with the introduction of a new signal which controls the *FromHost-FIFO\_rd\_en* results as software issues. Therefore data on FIFO are read immediately

| Name                   | Value | <br>200 | 400 | <br>600 | 800 |
|------------------------|-------|---------|-----|---------|-----|
| 🜡 fromHostFifo_empty   | 1     |         |     |         |     |
| 谒 fromHostFifo_rd_en   | 0     |         |     |         |     |
| 🔓 fromHostFifo_rst     | 0     |         |     |         |     |
| 🐻 toHostFifo_prog_full | 0     |         |     |         |     |
| 🔓 toHostFifo_wr_en     | 0     |         |     |         |     |
| 🐻 fromHostFifo_dvalid  | 0     |         |     |         |     |

Figure 5.9: ILA IP shows the signals probed. The trigger is set on FromHostFIFO empty at High level. This was done in order to detect the beginning of the data transmission after the command of *loopback*. It is important to notice that fromHostFIFO is never full.

after being written on it. This, as already said, is the reason why FIRST\_FIFO needs to be created.

The target of the subsequent study was the *ToHostFIFO* behavior consequent to a data delay of several periods of clocks (i.e. 100 clocks). During this test no software issues were found and the datastream was correct.

#### 5.3.2 FIRST and SECOND FIFOs

The two FIFOs structure was built and tested before the integration with the HT firmware to verify the complete control of the datastream. The structure works correctly and is compatible with Wupper: as it is possible to see from Fig. 5.10 *read* and *write* enable flags of the FIFOs behave as described in Section 5.2.2. In addition, from the comparison between the input file sent (with 1024 \* 128 words of 64-bits) and the output file produced, no inequalities are observed, meaning that all data were correctly transmitted. In a test in which only one bit activates during the transmission of the 256 bits word, it was observed that, with this configuration, *FIRST\_FIFO* is filled after 5 clock cycles (state of "*not empty*") from the moment in which the word is sent. This is what Fig.5.11 describes.

#### 5.3.3 Bits mapping

Before starting with the final test it was also important to study the way in which testvectors are transmitted in input to the HT component. Therefore, the mapping of the



Figure 5.10: Main signals studied to verify datastream with FIRST\_FIFO and SEC-OND\_FIFO. Initially, FromHostFIFO is red and subsequently FIRST\_FIFO is written and then red. After this, SECOND\_FIFO is written and then read. In the end To-HostFIFO\_wr\_en rises. Notice that FromHostFIFO is never full, and not always empty meaning that data are transferred. In blue and purple it is shown how, when the FIFOs are not empty, they are red. wr\_en2 is opposite to the rd\_en. First and second fifo are also never full, however this depends on the control flags and on the amount of data sent. If this is no longer true, the structure built will be able to control the data stream because signals are also controlled by prog\_full and prog\_full2.



Figure 5.11: The empty of the  $FIRST\_FIFO$  is no longer in the active level for the first time during the data transmission.

| 1 800000000000000000                                            | 40000000000000000                       | 8000000000000000                      | 4000000000000000                          |
|-----------------------------------------------------------------|-----------------------------------------|---------------------------------------|-------------------------------------------|
| 2 40000000000000000                                             | 800000000000000000                      | 40000000000000000                     | 80000000000000000                         |
| 3 8000000000000000                                              | 40000000000000000                       | 80000000000000000                     | 40000000000000000                         |
| 4 40000000000000000                                             | 800000000000000000                      | 40000000000000000                     | 800000000000000000                        |
| 5 8000000000000000                                              | 40000000000000000                       | 80000000000000000                     | 40000000000000000                         |
| 6 4000000000000000                                              | 800000000000000000                      | 40000000000000000                     | 800000000000000000                        |
| 7 8000000000000000                                              | 40000000000000000                       | 80000000000000000                     | 40000000000000000                         |
| 8 40000000000000000                                             | 800000000000000000000000000000000000000 | 40000000000000000                     | 800000000000000000                        |
| <b>9</b> fffffffffffffff                                        |                                         |                                       |                                           |
| 6 4000000000000000<br>7 800000000000000<br>8 400000000000000000 | 80000000000000000<br>400000000000000000 | 4000000000000000<br>80000000000000000 | 800000000000000000<br>4000000000000000000 |

Figure 5.12: Input file *TestVectors.txt* transmitted by PCIe to the two FIFOs system.

bits must be performed to find the signals which activate during the transmission of the data. This operation must be done because the HT firmware must receive specific inputs to start the acquisition and the processing of data. In this way a specific information can be addressed to the correct signal of the firmware and the highest bandwidth can be exploited. For the PCIe Gen3 x8, the internal FIFO interface provided by Wupper is 256 bits and, as already mentioned, is read or written at 250 MHz. The result is a theoretical throughput of 64 Gbit/s. Initially, the FromHost buffer in the server is filled with a hexadecimal word of 64-bit, sent towards the FPGA over DMA, passed to the new two FIFOs structure, and then immediately looped back into a second buffer. These considerations, along with a study of the *FromHostFIFO\_din* signals targeted for the tests, allow the mapping of the bits.

Resuming, as an example, in Fig. 5.14 it is shown the mapping according to the test performed targeting the bits 254, 191, 126, 63. According to this mapping, 128 \* 1024 words of 64-bits can be transmitted and read correctly.

| ILA Status: Idle    |       |    |                 |    | 96                             |       |                 |     |     |
|---------------------|-------|----|-----------------|----|--------------------------------|-------|-----------------|-----|-----|
| Name                | Value | 85 | 1 <sup>90</sup> | 95 |                                | 100   | 1 <sup>05</sup> | . 1 | 110 |
| 🐻 empty             | 1     |    |                 |    |                                |       |                 |     |     |
| > 😻 fromHos55:255]  |       |    | 0               | 5  | 1 0 1 0                        |       |                 | 1   |     |
| 🕷 fromHost54:254    | 0     |    |                 |    | 0/1/0/1                        |       | 0               |     | 1   |
| > 😻 fromHos96:196]  |       |    |                 |    | 0                              |       | <u> </u>        |     | 1   |
| > 😻 fromHost95:195  |       |    |                 |    | 0                              |       |                 |     | 1   |
| > 😻 fromHos94:194]  |       |    |                 |    | 0                              |       |                 |     | 1   |
| > 😻 fromHos28:128]  |       |    |                 |    | 0                              |       |                 |     | 1   |
| > 😻 fromHos27:127]  |       |    | 0               |    | 1/0/1/0                        |       |                 | 1   |     |
| > 😻 fromHos26:126]  |       |    | 5               | E  | 0/1/0/1                        | 0 (1) | 0               |     | 1   |
| > 😻 fromHos[64:64]  |       |    |                 |    | 0                              |       |                 |     | 1   |
| > 😻 fromHostt[63:63 |       |    |                 | Ð  | $0 \times 1 \times 0 \times 1$ |       | 0               |     | 1   |
| > 😻 fromHost[62:62] |       |    | 0               |    | 1/0/1/0                        |       |                 | 1   |     |
| > 😻 fromHostt_1[1:1 |       |    |                 |    | 0                              |       |                 |     | 1   |
| > 😻 fromHostt_2(0:0 |       |    |                 |    | 0                              |       |                 |     | 1   |
|                     |       |    |                 |    |                                |       |                 |     |     |

Figure 5.13: Study of the *FromHostFIFO\_dout* signals in contrapposition with the data of the file *TestVectors.txt*. Words of 256 bits are arranged in their turn in four hexadecimal words of 64 bits. Initially, the first 64-bits word is acquired and included to the Least significant bits (LSBs) of the 256 bits, and subsequently the others 64-bits words are acquired in the same way one after the other, through the pipeline and inserted in the 256 bits word. In the end, the word is transmitted from *FromHostFIFO* to *ToHostFIFO*, and the same operation is done for all the data used.



Figure 5.14: Bits organization.



Figure 5.15: Data of the clusters are arranged in *TestVector.txt* according to the layer to which they are associated and the order in which they are transmitted is respectively 8, 7, 6, 5, 4, 3, 2, 1 representing in the firmware the index 7, 6, 5, 4, 3, 2, 1, 0.

### 5.4 Final tests

The final architecture was realized with the integration of the HT firmware to the structure developed in Section 5.2.2, according to the results of the preliminary tests of Section 5.3. From the bits mapping done in Section 5.3.3, it was initially decided to send in input to the HT firmware two clusters  $(r, \phi)$  for each 8 layers with the exclusion of layers 0 and 7 (where are sent three clusters). The way in which the information of a layer is transmitted is schematized in Fig. 5.15. Due to the bandwidth available and the fact that the word (of 256 bits) sent in this case is known, it is possible to rearrange the bits in order to include also the information connected to the start and the end of the event (and also the so called  $HT_valid$  flag, which is in the high state while HT is processing). To perform the integration it is exploited the  $HT_outdata_valid$  signal to control the  $wr_en2$  signal already controlled by its full2 flag. In addition, data that initially were sent from  $FIRST_FIFO$  to  $SECOND_FIFO$ , in these tests are sent to the HT as input clusters according to the mapping of Fig.5.16.

In this way the implementation is completed and the resources consuming requirements of the FPGA are fulfilled as it is possible to see in Fig. 5.17 reporting the Project Summary. However, *timing* problems arise according to this configuration. This kind



Figure 5.16: Scheme of bits mapping for the realization of the tests. Example for only one 256 bits word referred to all the eight layers.

| Project Summary × Device ×                                                                           |                                      |                                                                                                                                                                         |                                                                      | ? 8 [                      |
|------------------------------------------------------------------------------------------------------|--------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|----------------------------|
| Overview   Dashboard                                                                                 |                                      |                                                                                                                                                                         |                                                                      |                            |
| DRC Violations                                                                                       |                                      | Timing                                                                                                                                                                  |                                                                      | Setup   Hold   Pulse Width |
| Summary: 0 35 warnings<br>0 8 advisories<br>Implemented DRC Report                                   |                                      | Worst Negative Slack (WNS):<br>Total Negative Slack (TNS):<br>Number of Failing Endpoints:<br>Total Number of Endpoints:<br>Implemented Timing Report                   | -1.898 ns<br>-8144.844 ns<br>19739<br>365694                         |                            |
| Utilization                                                                                          | Post-Synthesis   Post-Implementation | n Power                                                                                                                                                                 |                                                                      | Summary   On-Chip          |
| LUT 2.9%<br>LUTRAM 4%<br>FF 25%<br>BRAM 3.0%<br>DSP 25%<br>IO 2%<br>GT 22%<br>BUEG 3.3%<br>PCIe 3.3% |                                      | Total On-Chip Power:<br>Junction Temperature:<br>Thermal Margin:<br>Effective 8/4:<br>Power supplied to off-chip device<br>Confidence level<br>Implemented Power Report | 15.187 W<br>42.2 *C<br>42.8 *C (35.7 W)<br>1.1 *C/W<br>5: 0 W<br>Low |                            |

Figure 5.17: Project Summary showing the utilization and the timing. Failing Endpoints in the first test are 19739.



Figure 5.18: Project Summary with timing closed. As it is possible to see, the reduction of resources is associated to the less resources occupancy by the HT.

of issue is associated to the wiring procedure: the real signals transmitted into different components of the FPGA can be subjected to some delays because of the connection of too long distance areas, or different clock domains. To overcome this issue flip-flops are added into the design to avoid the delay of the signal information and, in addition, the resources occupancy of the HT was reduced.

In this way, as can be seen from Fig. 5.18, the implementation is completed without reporting timing issues and then tests can be performed since both structure, the HT and two FIFOs (and then also the Wupper FIFOs), are compatible. By executing the "loopback" function with the test-vectors selected, a software issue similar to that described in Section 5.3.1 arises, meaning that datastream on the Wupper FIFOs does not work correctly. To better understand the nature of this problem, a switch is used to probe if the issue is at the HT level or at the FIFO structure level. From the study of the *FromHostFIFO\_wr\_en* signal emerges that data are transferred correctly to the whole system, while the signal  $HT_outdata_valid$  do not behave correctly. In fact, Fig. 5.19 shows that this validation signal is always in the low logical state meaning that no data are valid and then that the information is not sent to SECOND\_FIFO and ToHostFIFO.

To exclude the hypothesis of a wrong bits mapping, a new process was generated to check if the first word observed is exactly the one sent in input to the HT. Fig.5.20 shows that the signal *data\_out* pass into the high logic state, meaning that the word is received by the HT firmware. This test reinforces the correct behaviour of the firmware



Figure 5.19: View of the onboard signal  $HT\_dataout\_valid$  causing the software issues since the '0' state does not activate the stream of data to the other components of the design.



Figure 5.20: Test-vectors transmitted by the structure built. High level state of the signal probed means that the bits mapping was done correctly and then the structure built to support the data transmission to the HT behaves correctly.

built.

However, Fig. 5.21 shows that the signals connected to the start (dout(255)) and stop (dout(223)) events in the HT did not behave as expected (the test bench of Fig. 5.22 shows the correct behaviour), and therefore it was modified the input file in order to include also the information of the valid signal. After this modification, the signals were probed considering HT running at 50 MHz. As a consequence, the *read* enable of the first FIFO and the *write* enable of the second (operation allowed thanks to the dual clock nature of them) run at 50 MHz as well. This was done because if the HT runs at 250 MHz the ILA can be used to probe only one signal (due to timing issues). In this way it was demonstrated the correct behaviour of the HT signals shown in Fig. 5.23. Since the "loopback" function cannot work at 50 MHz, it is decided to prove the feasibility of the current HT firmware implemented on hardware analyzing signals that are sent in output to the HT with the ILA: the cluster extracted are probed and compared to those expected by the test bench of Fig. 5.27.

For these tests it is taken into consideration that the "loopback" function cannot be performed because of the too low frequency, however data can be processed by the HT

| ILA Status: Idle  |       |     |     | 1   |     |           |     |     |     |     |     |
|-------------------|-------|-----|-----|-----|-----|-----------|-----|-----|-----|-----|-----|
| Name              | Value | 508 | 510 | 512 |     | 514       | 516 | 518 | 520 | 522 | 524 |
| > 😻 dout_1[5:255  | 0     |     |     |     |     |           | 1   |     |     | X   | 0   |
| > 😻 dout[223:223] | 0     |     | 0   | Х   |     |           |     | 1   |     |     |     |
| 🔓 HT_valid        | 0     |     |     |     |     |           |     |     |     |     |     |
| 🔓 locked          | 1     |     |     |     |     |           |     |     |     |     |     |
| la road_on        | 0     |     |     |     |     |           |     |     |     |     |     |
| v ₩ r_0_new[11:0] | 000   |     | 000 | Х   | 44c | 42d X 44c |     |     | 000 |     |     |
| wr_en2            | 0     |     |     |     |     |           |     |     |     |     |     |
| 🔓 rd_en           | 0     |     |     |     |     |           |     |     |     |     |     |
| locked_clks_out   | 1     |     |     |     |     |           |     |     |     |     |     |
| > ₩ r_0_0[11:0]   | 000   |     |     |     |     | 000       |     |     | 44c | 42d | 44c |
| > 😻 cnt_phip[7:0  | 00    |     |     |     |     |           | 00  |     |     |     |     |
|                   |       |     |     |     |     |           |     |     |     |     |     |
|                   |       |     |     |     |     |           |     |     |     |     |     |

Figure 5.21: Clusters are transmitted correctly to the HT, but *dout (255)* and *dout(223)* has a different behaviour compared to the test bench.

| Name                    | Value        | <br>300.000 ns      | 320.000 ns                     | 340.000 ns  |
|-------------------------|--------------|---------------------|--------------------------------|-------------|
| 🕌 clkin                 | 1            |                     |                                |             |
| 🕌 event_start_in        | 0            |                     |                                |             |
| 🕌 event_end_in          | 0            |                     |                                |             |
| ₩ event_valid           | 0            |                     |                                |             |
| > 👹 phi_phi0[7:0][22:0] | 000000,00000 | 000000,000000,00000 | 0, 000000, 000000, 000000, 000 | 0000,000000 |

Figure 5.22: Test bench shows that start signal must be in the high logic state for one clock period, valid signal rise, and then when the latter goes into the low logic state, the end rises for one period of clock.



Figure 5.23: HT receives in input the correct clusters and the behaviour of the valid signals in input and output is what expected. This operation with a clock of 250 MHz was performed and works in the same way, with the creation of a dedicated signal to probe.

| HT_outdata_valid          | 0     |  |     |       |         |       |   |    |  |
|---------------------------|-------|--|-----|-------|---------|-------|---|----|--|
| V cl_data_out_std[17:0]   | Зffff |  |     | Зffff |         |       |   |    |  |
| 14 HT_valid               | 1     |  |     |       |         |       |   |    |  |
| > ♥ r_0_0[11:0]           | 000   |  | 000 |       | 44c 42d | 44c 🔪 | 6 | 99 |  |
| > 😻 cnt_phi0_window_p(7:0 | 00    |  |     | 60    |         |       |   |    |  |
| 🔓 locked_clks_out         | 1     |  |     |       |         |       |   |    |  |
| 🔓 rd_en                   | 1     |  |     |       |         |       |   |    |  |
| 🔓 road_on                 | 0     |  |     |       |         |       |   |    |  |
| 😼 wr_en2                  | 0     |  |     |       |         |       |   |    |  |
|                           |       |  |     |       |         |       |   |    |  |

Figure 5.24: Input clusters r and  $\phi_0$  corresponds to the test-vectors sent by PCIe to the HT firmware. Signal describing them are r\_0\_0 and cnt\_phi0\_window\_p.



Figure 5.25: The output cluster extracted by the HT is of 7 bits, as expected. This can be observed studying the  $cl_dataout_std$  signal.

and probed. Initially it was observed in Fig. 5.24 if the test-vectors sent in input to the HT were exactly those of the *TestVector.txt* file. The study of the signal *cl\_dataout\_std* of Fig.5.26 and Fig. 5.25 allows to perform this test.

 $HT\_outdata\_valid$  and one output of HT result compatible with the test bench even if the values are different. This because in the firmware the clusters in input to the HT were generated internally instead of sent them from the software, this due to a lack of Wupper bandwidth. The other HT outputs were also checked. In detail, the *road\_on* signal must be in the high logic state if the road search is found. In the case of study, the 2 bits of the two searched roads correspond to 4-60 and 4-61 and they are found as Fig.5.28 confirms. Moreover, to verify the synchronism between the output of the HT and *SECOND\_FIFO* at 50 MHz, it is checked if the  $wr\_en2$  and the  $HT\_outdata\_valid$ are simultaneously in the high logical state, as Fig.5.29 confirmed. The PCIe firmware, according to the future ATLAS environment must run at 250 MHz and therefore HT as well. Running the HT firmware at 250 MHz can be more useful since the *ToHostFIFO* 



Figure 5.26: Running the firmware at 50 MHz it is possible to probe different signals without coming across timing issues. *cl\_dataout\_std* is the signal that gives the information of the output cluster that, after the comparison with the results of the test bench, prove the feasibility of the HT firmware design.

| Phi0_end_d2[15:0]         | 0000          |             |                 |               |                  |            |             |              | 0000         |                  |            |                         |                 |
|---------------------------|---------------|-------------|-----------------|---------------|------------------|------------|-------------|--------------|--------------|------------------|------------|-------------------------|-----------------|
| 🖉 cl_data_out[15:0][17:0] | 3ffff,3ffff,3 | Sffff, Sffi | f,Sffff,Sffff,S | ffff,Sffff,Sf | Sffff, Sff       | sffff, sff | Xatttf, att | XSffff, Sff. | X3fffff, 3ff | sffff, sff ysf   | fff,sffxsf | ff, sffff, sffff, sffff | f, sffff, sffff |
| > 🖬 [15][17:0]            | 3ffff         |             |                 |               |                  |            |             |              | Sffff        |                  |            |                         |                 |
| > 🖬 [14][17:0]            | 3ffff         |             |                 |               |                  |            |             |              | 3ffff        |                  |            |                         |                 |
| > 👹 [13][17:0]            | 3ffff         |             |                 |               |                  |            |             |              | Sffff        |                  |            |                         |                 |
| > 👹 [12][17:0]            | Зffff         |             |                 |               |                  |            |             |              | Sffff        |                  |            |                         | 1               |
| > 👹 [11][17:0]            | Зffff         |             |                 |               |                  |            |             |              | Sffff        |                  |            |                         |                 |
| > 👹 [10] [17:0]           | 3ffff         |             |                 |               |                  |            |             |              | Sffff        |                  |            |                         |                 |
| > 👹 [9][17:0]             | Stitt         |             |                 |               |                  |            |             |              | 3ffff        | 1                |            |                         | 1               |
| > 👹 [8][17:0]             | 3ffff         |             |                 |               |                  |            |             |              | Sffff        |                  |            |                         |                 |
| > 👹 [7][17:0]             | 3ffff         |             | Sffff           |               | 00007 X          | GCGCe      | χ stift     | 00007        | X 0003e      | *                |            | 31111                   |                 |
| > 👹 [6][17:0]             | Зffff         |             | Sffff           |               | 00006            | ceced      | X stfff     | 00006        | X 0000d      | *                |            | Sffff                   |                 |
| > 👹 [5][17:0]             | 3ffff         |             | Bffff           |               | 00005            | G003e      | X affff     | 00005        | 0000c        | *                |            | Bffff                   |                 |
| > 👹 [4][17:0]             | Stitt         |             | Bffff           |               | 00004            | 60006      | X affff     | 00004        | CC0035       | *                |            | 3ffff                   | 1               |
| > 🖬 [3][17:0]             | 31111         |             | Sffff           |               | 00003            | 6000%      | X attri     | 00003        | 00008        | *                |            | Sffff                   |                 |
| > 👹 [2][17:0]             | 3ffff         |             |                 |               |                  |            |             |              | Sffff        |                  |            |                         |                 |
| > 👹 [1][17:0]             | Зffff         |             | Sffff           |               | 00002 X          | 66669      | X attit     | 00002        | X 00009      | *                |            | Sffff                   |                 |
| > 👹 [0][17:0]             | Stitt         |             | Bffff           |               | χ <u>00001 χ</u> | 00008      | χ cooof     | χ sffff      | X 00001      | <u>χ 60008</u> χ | igeogf χ   |                         | 3fff            |
| busy clu out or d2        | 0             |             |                 |               |                  |            |             | _^           |              |                  |            |                         |                 |
|                           |               |             |                 |               |                  |            |             |              |              |                  |            |                         |                 |
|                           |               |             |                 |               |                  |            |             |              |              |                  |            |                         |                 |

Figure 5.27: Results of the test bench performing the same analysis at 250 MHz. The cluster in output must be  $cl_data_out$ . In gray is shown the  $HT_outdata_valid$  signal that is 7 bits.



Figure 5.28: The 2 bits of *road\_on* indicate that the two roads searched are found.

| Name                       | Value | 228   | 230         | 232         | 234         | 236   | 238   |
|----------------------------|-------|-------|-------------|-------------|-------------|-------|-------|
| 🔓 road_on                  | 1     |       |             |             |             |       |       |
| 🔓 HT_valid                 | 0     |       |             |             |             |       |       |
| > ♥ r_0_0[11:0]            | 000   |       |             | 000         |             |       |       |
| 🗧 😻 cnt_phi0_window_p[7:0] | 00    |       |             | 00          |             |       |       |
| 🐱 locked_clks_out          | 1     |       |             |             |             |       |       |
| 🔓 rd_en                    | 1     |       |             |             |             |       |       |
| 🔓 wr_en2                   | 0     |       |             |             |             |       |       |
| V cl_data_out_std[17:0]    | Зffff | зffff | 00000 00001 | 00002 3ffff | 00000 00001 | 00002 | ∃ffff |
| 18 HT_outdata_valid        | 0     |       |             |             |             |       |       |
|                            |       |       |             |             |             |       |       |

Figure 5.29: Synchronism between the output of the HT and the input of the second FIFO is studied probing the HT outdata valid and the wr en2 signals.

can receive the extracted cluster more frequently. However, while in the 50 MHz running there is a synchronism between the information sent from the HT and the second FIFO, in this case synchronism is completely lost:  $wr\_en2$  is never in the high logic state, meaning that no data are written on this FIFO. Despite the issue, HT implemented on hardware runs correctly at 50 MHz, but other tests must be performed to reproduce the ATLAS environment. This is what will be done in the next months, as well as to implement the whole structure on the board that will be part of the TDAQ of ATLAS Phase-II. This board is a custom board called FELIX Phase-II, with the FPGA VU9P and it is shown in Fig. 5.30. This board was developed by the FELIX group, which is one of the group assigned of the current TDAQ of ATLAS and that will be responsible of the FELIX block in the TDR of Section 3.2.1. Since the HT firmware is supposed to increase the resources used, the VU9P can be suitable to test a future extension of the firmware since this is the Wupper supported FPGA with the highest occupancy.



Figure 5.30: Custom board FELIX Phase-II with VU9P.

## Conclusions

The scope of this Master Thesis is the creation of a hardware demonstrator able to test the ongoing Hough Transform firmware design. According to the new proposal of the Event Filter Tracking, this is one of the algorithms under evaluation for the track reconstruction that will be part of the Event Filter of ATLAS Phase-II Trigger and Data Acquisition system upgrade. My work contributes to the creation of a structure able to perform the very first tests on the firmware. In particular, this was made possible by the integration of the PCIe and the Hough Transform firmware designs developed until now. To overcome the issues connected to several constraints of both of them, another structure relying on the inclusion of two First-In-First-Out was developed. Different tests were performed to verify the complete control of the datastream of the new structure and also to map the bits which represent the clusters of the eight layers of the Inner Tracker in order to transfer them as input to the HT. From the study of the on-board signals and the output produced, emerges that the results are in agreement with the expectations. This means that the integration of the two designs has been successfully accomplished. The issues which arose during the first tests of the HT firmware were fixed probing different signals and, in this way, it is proved finally the feasibility of the HT algorithm implementation on hardware: the extracted roads are in agreement with the simulations. Therefore this work contributes to combine the necessities of two firmware designs through the creation of a new structure that, exploiting the PCIe transmission, is able to prove the correct hardware implementation of the HT. This study can be relevant for the future implementation of the HT in the ATLAS future TDAQ environment. Final goal of the Event Filter Tracking demonstrator is to emulate the data acquisition at the Event Filter TDAQ of ATLAS Phase-II and the future plans consist

of the exploitation of test-vectors coming from the ATLAS official software to perform the tests done until now. The final proposal will be set after the comparison of the data produced by the demonstrator with those simulated by ATLAS. In addition, the next plan is the implementation of the HT in the future ATLAS Phase-II TDAQ boards, the FELIX Phase-II with the VU9P FPGA. Despite the flexibility of the structure developed, this represents a challenging operation due to the custom nature of the card. In conclusion, in case the HT algorithm implemented on hardware will show good numbers in terms of power consumption or cost, with respect to a fully software solution, then it can be implemented on hardware by continuing our work on this starting test-stand.

## Appendix A

# FPGA

Field Programmable Gate Arrays (FPGAs) are programmable integrated circuits designed to be configured by a customer or a designer after manufacturing (from here the name *field-programmable*) using a hardware description language (HDL). FPGAs are composed of different logic blocks, in which are distributed arrays of logic states, that are embedded in a general routing structure (from here the name *gate-array*). An FPGA contains, in its integrated circuit (IC), millions of gates.

FPGAs can be programmed and re-programmed with different logic functions, allowing flexible reconfigurable computing. This idea is different from the common concept of software programming because in this case, depending on the market vendor, the programming is done either by blowing interconnection fuses or by establishing links with



Figure A.1: Scheme of an FPGA. Logic cells are embedded in a general routing structure. The main component blocks are focused and are: the Configurable Logic Blocks (CLB), the Programmable Interconnects (PI) and the input-output Blocks (I/OB).

anti-fuses, or by using static-RAM technology. These semiconductor flexible devices have a high hardware-timed speed and reliability. One of the most important characteristics is their high level of parallelization, indeed they are able to process different tasks without competing for the same resources: each independent processing operation is assigned to a dedicated section of the chip and can operate autonomously without affecting any other logic blocks. Therefore the performance of one part of the application is not influenced when more processing is added. One of the key metrics for an FPGA implementation include the clock frequency, which represents the frequency of the clock signal used to perform sequential synchronized operations on the FPGA. In particular, the possibility of a parallel execution of independent pieces of hardware logic allows them to run eventually at different clock rates.

The main components of FPGA are shown in Fig. A.1 and they are:

- the Configurable Logic Blocks (CLB), which includes the digital logic, inputs and outputs;
- the Programmable Interconnects(PI), which provide direction between the logic blocks to implement the use logic;
- the input-output Blocks (I/OB), which are used by the circuit to interact with the outside world.

Routing consists of wire segments of varying lengths which can be interconnected via electrically programmable switches. Inside an FPGA, when a design circuit has to be mapped with the internal components, these cells and in particular the connections within them are considered as resources. During the selection of an FPGA for a particular application, resources in FPGA specifications are very important because, in particular, the number of configurable logic blocks, the number of fixed function logic blocks (such as multipliers), and the size of memory resources (such as embedded block RAM), are specific to the type of the FPGA chip. As a consequence, programmable interconnects to implement a reconfigurable digital circuit as well as I/OB will also be limited.

CLBs are the basic logic unit of an FPGA and they are made up of two basic components: flip-flops (FFs) and lookup tables (LUTs). FPGAs are organized in families depending on the way these elements are packaged together. FFs are binary shift registers able synchronize logic and save logical states between clock cycles. On every clock edge, a FF latches the 1 or 0 (*true* or *false*) value on its input and holds that value constant until the next clock edge (they behave like a single-bit storage element). LUTs are combinatorial logic elements (logic gates) that can easily implement a truth table. From a circuit implementation perspective, a LUT can be formed simply from an Ntoone multiplexer and an N-bit memory. Another useful component is the Digital Signal Processor DSP, a multiplier followed by an accumulator implemented into the FPGA fabric. This element can be used to multiply numbers with arbitrary bitwidth.

Memory resources are another important specification to consider for the selection of an FPGA. User-defined RAM, embedded throughout the FPGA chip, is useful for storing data sets or passing values between parallel tasks. Depending on the FPGA family, it is possible to configure the onboard RAM in blocks of 16 or 36 kb. The option to implement data sets as arrays using flip-flops become expensive for FPGA logic resources but with an embedded block RAM the same task can be performed more efficiently. In addition, digital signal processing algorithms often need to keep track of an entire block of data, or of the coefficients of a complex equation, and without on-board memory, many processing functions do not fit within the configurable logic of an FPGA chip. An on-board memory is also often used to smooth out the data stream using first-in-first-out (FIFO) memory buffers, because, as already said, the logic running can be difficult due to the possible existence of different clock rates consequent to the parallelization.

Another key metrics for an FPGA implementation include the latency, which is the total time (typically expressed in units of clock periods) required for a single iteration of the algorithm to complete.

FPGA algorithm design is unique compared to programming a CPU: FPGAs can achieve a much higher number of operations per second at a relatively low power cost compared to CPUs and GPUs. However, such operations consume dedicated resources on-board the FPGA and cannot be dynamically remapped while running. The challenge for an optimal FPGA implementation is to balance FPGA resource usage with achieving the latency and throughput goals of the target algorithm.

Customizing an FPGA is a process of creating a bitstream, a memory bit pro-

grammed, representing the configuration file that will be loaded into the device and gives information on how the components should be wired together. The hardware description languages (HDLs) such as VHDL and Verilog are low-level languages which are able to create the architecture of the circuit on the FPGA.

## Appendix B

# PCIe

The Peripheral Component Interconnect Express (PCIe) [43], is a high-speed serial computer expansion bus standard, designed to replace the older PCI, PCI-X and AGP bus standards [39]. This technology is based on point-to-point topology, with separate serial links connecting every device to the root complex (host) shown in Fig. B.1. In the presence of multiple masters, this shared topology limits and arbitrates the access to the older PCI bus, to only one master at a time, in a single direction. The bus link between any two endpoints supports a full-duplex communication, with no inherent limitation on concurrent access across multiple endpoints. In terms of bus protocol, PCI Express communication is encapsulated in packets. The tasks of packetizing and de-packetizing data and status-message traffic is handled by the transaction layer of the



Figure B.1: Scheme representing the PCI Express topology. The device downstream ports are the white "junction boxes", while the gray ones represent upstream ports [46].



Figure B.2: PCIe device layers.

PCI Express port shown in Fig. B.2. Compared to older standards, PCIe enhancements include: higher maximum system bus throughput, lower I/O pin count and smaller physical footprint, better performance scaling for bus devices, a more detailed error detection and reporting mechanism (Advanced Error Reporting, AER), and native hot-swap functionality. Another important information is the encoding scheme, which gives information about the maximum payload sent. Physical PCI Express links may contain 1, 4, 8 or 16 lanes and, in order to specify those used, an "x" prefix is written. Each lane is composed of two differential signaling pairs: one pair for receiving data and the other for transmitting. This means that, conceptually, each lane is used as a full-duplex byte stream, transporting data packets (in the format of eight-bit "byte") simultaneously in

| Version | Speed x1              |
|---------|-----------------------|
| 1       | $250 \mathrm{~MB/s}$  |
| 2       | $500 \ \mathrm{MB/s}$ |
| 3       | $985~\mathrm{MB/s}$   |
| 4       | $1.97~\mathrm{GB/s}$  |
| 5       | $3.94~\mathrm{GB/s}$  |

Table B.1: Bandwidth per single lane of PCIe of different generations[43].

both directions between endpoints of a link. In a multi-lane link, the packet data is striped across lanes, and peak data throughput scales with the overall link width. The number of lanes used is automatically established during the device initialization, and eventually can be restricted by either endpoint. During years, different generations of PCIe have been developed and they are characterized according to performance in terms of bandwidth; in Tab.B.1 it is possible to see their differences.

# Appendix C

### FIFO

First-in-first-out (FIFO) [47] is a type of buffer in which the data written first comes out of it first. FIFOs can be implemented with software or hardware. The choice between these two solutions depends on the application and the features desired: a software FIFO easily can be adapted to changes by modifying its program, while a hardware FIFO may demand a new board layout but it is faster. Two electronic systems are connected to the input and output of a FIFO: one that writes the data into the FIFO and one that reads it. Writing and reading systems must be synchronized to maintain a certain timing conditions. This is a read/write FIFO. On the other hand, if the writing system and the reading system can work out of synchronism, the FIFO is called concurrent read/write.

The Xilinx LogiCORE IP FIFO Generator core [47] is used for the scope of this thesis and it is a fully verified FIFO memory queue for applications that require in-order storage and retrieval. The core provides an optimized solution for all FIFO configurations and delivers maximum performance (up to 500 MHz) while spending minimum resources. This core is delivered through the Vivado (R) Design Suite, and generally it is possible to customize the width, depth, status flags, memory type, and the write/read port aspect ratios. The Native interface FIFO can be customized to utilize block RAM, distributed RAM or built-in FIFO resources available in some FPGA families to create high-performance, area-optimized FPGA designs. AXI interface FIFOs are derived from the Native interface FIFO, as shown in Fig. C.2 and Fig. C.3. A two-way valid and ready handshake mechanism is used by the AXI interface protocol. The information



Figure C.1: FIFO general logic structure.



Figure C.2: Signal Diagram of Native Interface FIFOs.



Figure C.3: AXI FIFO Derivation.



Figure C.4: Timing Diagram of AXI4-Stream FIFO.

source uses the *valid* signal to show when valid data or control information is available on the channel. The information destination uses the *ready* signal to show when it can accept the data. An example timing diagram for write and read operations to the AXI4-Stream FIFO is shown in Fig. C.4, while Fig. C.5 reports an example timing diagram for write and read operations to the AXI memory mapped interface FIFO.

In these two timing diagrams, the *valid* signal is generated by the information source to indicate when the data is available. When the data can be accepted, the destination generates the ready signal. Transfer occurs only when both the *valid* and *ready* signals are high. Since AXI FIFOs are derived from Native interface FIFOs, their behavior is similar. The *ready* signal is generated based on availability of space in the FIFO: it is held high to allow writes to the FIFO and it is pulled low only when there is no space in the FIFO left to perform additional writes. The *valid* signal is generated based on availability of data in the FIFO: it is held High to allow reads to be performed from the



Figure C.5: AXI Memory Mapped Interface FIFO Timing Diagram.

FIFO, while it is pulled low only when there is no data available to be read from the FIFO.

The information signals are mapped to the *din* and *dout* bus of Native interface FI-FOs. The width of the AXI FIFO is determined by concatenating all of the information signals of the AXI interface. The information signals include all AXI signals, with the exclusion of the *valid* and *ready* handshake signals. AXI FIFOs operate only in First-Word Fall-Through (FWFT) mode. This feature provides the ability to look ahead to the next word available from the FIFO without issuing a read operation. When data is available in the FIFO, the first word falls through the FIFO and appears automatically on the output data bus (*dout*). FWFT is useful in applications that require Low-latency access to data and to applications that require throttling based on the contents of the read data. In digital designs, FIFOs are ubiquitous constructs required for data manipulation tasks such as clock domain crossing, low-latency memory buffering, and bus width conversion. More details about features, configuration and implementation, are reported in the LogiCORE IP Product Guide of FIFO Generator v13.2 [47].

# Bibliography

- T. S. Pettersson and P. Lef'evre, The Large Hadron Collider: conceptual design, Tech. Rep. CERN-AC-95-05-LHC, Oct 1995. [Online]. Available:https://cds. cern.ch/record/291782
- [2] M. Krause, CERN: how we found the Higgs boson. Hackensack, NJ: World Scientific, Nov 2014, german edition with the title : Wo Menschen und Teilchen aufeinanderstossen : Begegnungen am CERN. [Online]. Available:https://cds.cern.ch/ record/1748524
- [3] Investigation of Ageing effects and Image stability in Hybrid Photon Pixel detectors at the LHCb experiment CERN - Scientific Figure on ResearchGate.
- [4] Introduction to Particle Accelerators and their Limitations Scientific Figure on ResearchGate.
- [5] Xabier Cid Vidal, Ramon Cid Manzano. Taking a closer look at LHC: LHC Pb collisions.
- [6] The ATLAS Experiment at the CERN Large Hadron Collider, To cite this article: The ATLAS Collaboration et al 2008 JINST 3 S08003, Journal of Instrumentation
- [7] CMS Collaboration, "The CMS Experiment at the CERN LHC," JINST, vol. 3, p. S08004, 2008.
- [8] LHCb Collaboration, "The LHCb Detector at the LHC," JINST, vol. 3, p. S08005, 2008.

#### BIBLIOGRAPHY

- [9] ALICE Collaboration, "The ALICE experiment at the CERN LHC," Journal of Instrumentation, vol. 3, no. 08, p. S08002, 2008. [Online]. Available:http://stacks. iop.org/1748-0221/3/i=08/a=S08002
- [10] https://project-hl-lhc-industry.web.cern.ch/content/project-schedule
- [11] Aad g. et l. The ATLAS Experiment at CERN Large Hadron Collider. JINST 3, S08003 (2008).
- [12] ATLAS inner detector: Technical Design Report, 1, ser. Technical Design Report ATLAS. Geneva: CERN, 1997. [Online]. Available: https://cds.cern.ch/ record/331063
- [13] N. Wermes and G. Hallewel, ATLAS pixel detector: Technical Design Report, ser. Technical Design Report ATLAS. Geneva: CERN, 1998. [Online]. Available: https: //cds.cern.ch/record/381263
- [14] Y. Unno, "ATLAS silicon microstrip detector system (SCT)," Nucl. Instrum. Meth., vol. A511, pp. 58–63, 2003.
- [15] E. Abat et al., "The ATLAS Transition Radiation Tracker (TRT) proportional drift tube: Design and performance," JINST, vol. 3, p. P02013, 2008.
- [16] M. Capeans et al., "ATLAS Insertable B-Layer Technical Design Report," Tech. Rep. CERN-LHCC-2010-013. ATLAS-TDR-19, Sep 2010. [Online]. Available:https:// cds.cern.ch/record/1291633
- [17] Alignment of the ATLAS Inner Detector in Run-2 Preprint, ATLAS Collaboration, Jul 2020
- [18] M. Capeans et al. ATLAS Insertable B-Layer Technical Design Report. Tech. rep. CERN-LHCC-2010-013. ATLAS-TDR-19. Sept. 2010. http://cds.cern.ch/ record/1291633.
- [19] D. Caforio, "Luminosity measurement using Cherenkov Integrating Detector (LU-CID) in ATLAS," in Astroparticle, particle and space physics, detectors and medical

physics applications. Proceedings, 10th Conference, ICATPP 2007, Como, Italy, October 8-12, 2007, 2008, pp. 413–417.

- [20] P. Jenni, M. Nessi, and M. Nordberg, "Zero Degree Calorimeters for ATLAS,"CERN, Geneva, Tech. Rep. LHCC-I-016. CERN-LHCC-2007-001, Jan 2007. [Online]. Available:https://cds.cern.ch/record/1009649
- [21] L. Adamczyk et al., "Technical Design Report for the ATLAS Forward Proton Detector," Tech. Rep. CERN-LHCC-2015-009. ATLAS-TDR-024, May 2015. [Online]. Available:https://cds.cern.ch/record/2017378
- [22] S. Jakobsen, P. Fassnacht, P. Hansen, and J. B. Hansen, "Commissioning of the Absolute Luminosity For ATLAS detector at the LHC," Dec 2013, presented 31 Jan 2014. [Online]. Available:https://cds.cern.ch/record/1637195
- [23] P. Jenni, M. Nessi, M. Nordberg, and K. Smith, ATLAS high-level trigger, dataacquisition and controls: Technical Design Report, ser. Technical Design Report AT-LAS. Geneva: CERN, 2003. [Online]. Available: https://cds.cern.ch/record/ 616089
- [24] Aaboud, Morad et al -CERN-EP-2016-241arXiv:1611.09661
- [25] M. Dano Hoffmann, "Commissioning and Initial Run-2 Operation of the ATLAS Minimum Bias Trigger Scintillators," CERN, Geneva, Tech. Rep.ATL-DAQ-PROC-2015-020, Jun 2015. [Online]. Available:https://cds.cern.ch/record/2025200
- [26] ATLAS Collaboration. Phase-II upgrade Scoping Document. Tech. rep. CERNLHCC- 2015-020. LHCC-G-166. Geneva: CERN, Sept. 2015. Available: https://cds.cern.ch/record/2257755.
- [27] ATLAS Collaboration. Technical Design Report: A High-Granularity Timing Detector for the ATLAS Phase-II upgrade. CERN-LHCC-2020-007. ATLASTDR-031 (2020).

#### BIBLIOGRAPHY

- [28] ATLAS Collaboration. Technical ATLAS Collaboration. Technical Design Report for the ATLAS Liquid Argon Calorimeter Phase-II Upgrade. CERN-LHCC-2017-018. ATLAS-TDR-027 (2017).
- [29] Design Report for the Phase-II Upgrade of the ATLAS Tile Calorimeter. CERN-LHCC-2017-019. ATLAS-TDR-028 (2018).
- [30] ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of the ATLAS Muon Spectrometer. CREN-LHCC-2017-017. ATLAS-TDR-026 (2017).
- [31] ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of the AT-LAS Trigger and Data Acquisition System. CERN-LHCC-2017-020. ATLASTDR-029 (2018).
- [32] ATLAS Collaboration. Report of the Task Force Studying a Commodity-based Implementation of the Phase 2 TDAQ Architecture Exploiting Heterogeneous Computing, Event Filter Tracking Revision, ATLAS Doc.: 2591106 v.2 EDMS Id:2591106 v.2
- [33] Allam Shehata Hassanein et al. A survey on Hough Transform, Theory, Techniques and Applications. IJCSI Vol 12, I 1, No 2, ISSN 1694-0784 (2015).
- [34] P.V.C. Hough, Machine Analysis of Bubble Chamber Pictures, Conf. Proc. C590914 (1959), 554.
- [35] L.Rinaldi et al.,GPGPU for track finding in High Energy Physics, Proceedings, GPU Computing in High-Energy Physics (GPUHEP2014):Pisa, Italy, September 10-12,2014. 2015, 17. arXiv:1507.03074 [physics.ins-det].
- [36] N.Pozzobon, F.Montecassiano, and P.Zotto, Anovel approach to Hough Transform for implementation in fast triggers, Nucl. Instrum. Meth. A,834 (2016), 81.
- [37] M. Mårtensson, "A search for leptoquarks with the ATLAS detector and hardware tracking at the High-Luminosity LHC". PhD thesis. Uppsala U.,2019.

- [38] F.Alfonsi, "Study and Optimization of Particle Track Detection via Hough Transform Hardware Implementation for the ATLAS Phase-II Trigger Upgrade". PhD thesis.
- [39] Mayhew, D., Krishnan, V. (August 2003). "PCI express and advanced switching: Evolutionary path to building next generation interconnects". 11th Symposium on High Performance Interconnects, 2003. Proceedings. pp. 21–29.
- [40] Frans Schreuder, Andrea Borga, Oussama el Kharraz Alami, Roel Blankers, "Wupper- a Xilinx Virtex-7 PCIe Engine", Official Wupper project (OpenCores version) http://opencores.org/project,virtex7\_pcie\_dma
- [41] Atlas FELIX project http://atlas-project-felix.web.cern.ch
- [42] XILINX, VC709 Evaluation Board for the Virtex-7 FPGA User Guide, UG887 (v1.6) March 11, 2019
- [43] PCI Express® Base Specification Revision 2.0, December 20, 2006
- [44] UG761: Xilinx AXI Bus documentation http://www.xilinx.com/ support/documentation/ipdocumentation/axirefguide/latest/ ug761axireferenceguide.pdf
- [45] XILINX, Integrated Logic Analyzer v6.2, LogiCORE IP Product Guide, Vivado Design Suite PG172 October 5, 2016.
- [46] Ravi Budruk (21 August 2007). "PCI Express Basics". PCI-SIG. Archived from the original (PDF) on 15 July 2014. Retrieved 15 July 2014.
- [47] XILINX, FIFO Generator v13.2, Vivado Design Suite PG057 October 4, 2017.