

### Zen and the Art of Performance Monitoring

Michael Chynoweth - Sr. Principal Engineer Intel Corporation

Contributors: Joe Olivas, Patrick Konsor, Rajshree Chabukswar, Seth Abraham, Stas Bratanov



Intel Confidential - Do Not Forward

# Agenda

- End in Mind
  - Show some of the innovations we have in performance monitoring
  - Demonstrate how those advancements resolve problem with examples
- Topic1: Delivering definitive paths to debug across all architectures
  - Top Down and how it helps avoid pitfalls
- Topic2: Determining paths of execution, call stacks and timing
- Topic3: Large amounts of data, small timeframe, small perturbance or blind spots



### Topic1: Delivering Definitive Paths to Debug Power and Performance

- The Lines Between Segments is Blurring
  - Customers ask for training on all segments Quark®, Atom®, Core® and Xeon®
- Problem = Performance monitoring unit features are designed by experts
- Without a clear optimization path, our customers will get on "tangents"
  - Example: Customer was concentrating on memory ordering "Nukes" as part of their performance analysis (Using Top Down we found Nukes were 0.3% of execution)





Memory Ordering "Nukes"

Memory Ordering "Fluffy, Harmless, Rainbow-Colored Bunnies"

Customers Should Be Pointed by Our Methodology and Tools Exactly Where to Look



# Defining One Starting Point For Core, Uncore and Power Across All SoCs



### Consistent Methodologies to Avoid Tangents: Example = Top Down Methodology for Debugging CPU Bottlenecks



### Top Down breaks the pipeline into 4 categories

- Front End Bound = Bound in Instruction Fetch -> Decode (Instruction Cache, ITLB)
- Back End Bound = Bound in Execute -> Commit (Example = Execute, load latency)
- Bad Speculation = When pipeline incorrectly predicts execution (Example branch mispredict memory ordering nuke)
- Retiring = Pipeline is retiring uops



#### Event NO\_ALLOC\_CYCLES.NOT\_DELIVERED counts when BackEnd requests UOPs and FrontEnd Cannot Deliver **Front End** 10 cycle mispredict **MSROM** 13 IF<sub>2</sub> IF3 ID1 ID<sub>2</sub> ID3 **Instruction Fetch** Instruction Decode RSV RB1 RB2 RB3 RB4 AR2 EX AR1 Allocate Commit Retire sched Execute 200 Rename DC2 RSV AG DC1 Data Cache sched

### **Back End**



# Why Do We Use Top Down to Drive Looking at Other Events?

| Stats                          | BayTrail | Calculation                                           |
|--------------------------------|----------|-------------------------------------------------------|
| Cycles Per Instruction (CPI)   | 2.9      | CPU_CLK_UNHALTED.CORE/INST_RETIRED.ANY                |
| Front End Bound Cost           | 0.1%     | NO_ALLOC_CYCLES.NOT_DELIVERED*1/CPU_CLK_UNHALTED.CORE |
| Microcode Sequencer Entry Cost | 57.0%    | MS_DECODED.MS_ENTRY*5/CPU_CLK_UNHALTED.CORE           |
| MSUOPS/UOP_RETIRED             | 65%      | UOPS_RETIRED.MS/UOPS_RETIRED.ALL                      |

Microcode Sequencer Entry Cost is 57% of all cycles?! Should I raise the alarm?





# When Would the Microcode Sequencer Matter?

| Stats                          | BayTrail | Minimized<br>Baytrail | Calculation                                           |
|--------------------------------|----------|-----------------------|-------------------------------------------------------|
| Cycles Per Instruction (CPI)   | 2.9      | 4.6                   | CPU_CLK_UNHALTED.CORE/INST_RETIRED.ANY                |
| Front End Bound Cost           | 0.1%     | 63.6%                 | NO_ALLOC_CYCLES.NOT_DELIVERED*1/CPU_CLK_UNHALTED.CORE |
| Microcode Sequencer Entry Cost | 57.0%    | 41.4%                 | MS_DECODED.MS_ENTRY*5/CPU_CLK_UNHALTED.CORE           |
|                                |          |                       |                                                       |

Looking at just events is dangerous. Even though MS Entry cost is 57% of all cycles we know that it is not impacting performance!

FE 64% of all cycles! 2x MS issues explain ~97% FE bottleneck



# When Would the Microcode Sequencer Matter?

| Stats                              | BayTrail | Minimized<br>Baytrail | Calculation                                           |
|------------------------------------|----------|-----------------------|-------------------------------------------------------|
| Cycles Per Instruction (CPI)       | 2.9      | 4.6                   | CPU_CLK_UNHALTED.CORE/INST_RETIRED.ANY                |
| Front End Bound Cost               | 0.1%     | <mark>63.</mark> 6%   | NO_ALLOC_CYCLES.NOT_DELIVERED*1/CPU_CLK_UNHALTED.CORE |
| Microcode Sequencer 1/2 Speed Cost | 0.0%     | 20.5%                 | UOPS_RETIRED.MS/(2*CPU_CLK_UNHALTED.CORE)             |
| Microcode Sequencer Entry Cost     | 57.0%    | 41.4%                 | MS_DECODED.MS_ENTRY*5/CPU_CLK_UNHALTED.CORE           |



#### Top Down Let Us Identify Items We Did Not Understand



# Where Is It Going?: Would Like to Better Tag Indirect Impacts with Top Down

| Atom GFX 640 MHz             | 1360  | 1600   | 2000   | 2400  | 2560  | Entire Data Set | CHT & BDW IA Senstivity with NullHWCHT (NULL Hw) |
|------------------------------|-------|--------|--------|-------|-------|-----------------|--------------------------------------------------|
| Atom (GFX Enabled) FPS       | 96.06 | 106.21 | 120.35 | 132.4 | 137.5 | 1.43            | 800                                              |
| Perfect Frequency<br>Scaling |       | 1.18   | 1.25   | 1.20  | 1.07  | 1.88            | <b>L</b> 600                                     |
| Actual Scaling               |       | 1.11   | 1.13   | 1.10  | 1.04  | 1.43            | <b>Ě</b> 400<br><b>B</b> 300                     |
| Frequency_Dep%               |       | 60%    | 53%    | 50%   | 58%   | 49%             | 200                                              |
| Non_Frequency_Dep%           |       | 40%    | 47%    | 50%   | 42%   | 51%             |                                                  |
|                              |       |        |        |       |       |                 | <u> </u>                                         |

Latency to Memory Causing Problems

Source: Graphics Arch Lab

IA Frequency (MHz)

| Module_Name                                                                     | HotThread% | СРІ  | OBSERVATIONS                                            | Issue_Summary                                        |  |
|---------------------------------------------------------------------------------|------------|------|---------------------------------------------------------|------------------------------------------------------|--|
| Benchmark.exe:<br>Benchmark.exe                                                 | 65.80%     | 2.89 | Retiring=19.44:FrontEnd=23.3:<br>BackEnd&BadSpec=57.27: | L2_MISS=19%_D:ICACHEMISSES=10%_D                     |  |
| Benchmark.exe:<br>igd10iumd32.dll                                               | 13.11%     | 4.15 | _                                                       | ICACHEMISSES=34%_D:ITLB_MISSES=8%:L2_<br>MISS=10%_D: |  |
| Benchmark binary has a large data cache footprint                               |            |      |                                                         |                                                      |  |
| Graphics Driver has a large instruction cache footprint                         |            |      |                                                         |                                                      |  |
| Graphics Driver and Benchmark Binary Battle Over Instruction + Data Real Estate |            |      |                                                         |                                                      |  |

### What Are The Last Branch Records?

|             | 63      | 62            | 61               | 60:48   | 47:16    | 15:0            |
|-------------|---------|---------------|------------------|---------|----------|-----------------|
| LBR_FROM_IP |         | SIGN_EXT (bit | LBR FROM address |         |          |                 |
| LBR_TO_IP   |         | SIGN_EXT (bit | LBR TO           | address |          |                 |
| LBR_INFO    | MISPRED | IN_TX         | TSX_ABORTED      |         | Reserved | cycle-count (*) |
|             |         |               |                  |         |          |                 |

LBR Overview:

- LBRs dynamically track the last N taken branches:
  - N can now traverse from 8 to 32 taken branches
  - LBRs can be filtered for types of branches
- How are they used today?
  - Use them to recreate paths of execution
  - Assist in obtaining basic block hit counts
    - Used to weight cost of all
  - Paths of execution (function, branch, module)
  - Compilers are starting to use them for profile guided feedback
    - Example = AutoFDO
- Most Recent
  - Call stacks to any point of interest with LBR call stack
  - Cycle count

Pay attention, this one is brand new

### LBR is Utilized to Recreate Hot Path of Execution



LBR Allows Visibility of Complete Transaction



### How Does Adding Timing Help?



# What is Intel® Processor Trace?

Intel® Processor Trace (Intel® PT) is a hardware feature that logs information about software execution with minimal impact to system execution

- Supports control flow tracing with <5% overhead
  - Decoder can determine exact flow of software execution from trace log

Can store both cycle count and timestamp information



# Intel Processor Trace is Delivering New Capabilities

### **Zooming at Microsecond Granularities**



#### **Locking Debug**

Determining Contention on a Lock: RETRY\_LOCK = 10564 GOT\_LOCK = 43542 10564/(43542 + 10564) = 0.1952 (or 20% contended)

| É                 | B::Trace.FindAll , A     | ddress _retry_protection     |
|-------------------|--------------------------|------------------------------|
| 10564 run address | cycle d                  | lata symbol                  |
| 12431000 1 1 NO.0 | 000000220ECCPD staars    | C ITC/Clobal) er             |
| É                 | B::Trace.FindAll , Addre | ss V.RANGE("_already_owned") |
| 43542 run address | cvcle o                  | data symbol                  |

#### **Exceptions Hurting Performance**



# SoC Sizing using Modeling with Instruction Traces

- Model Icache, ITLB, and pre-decode in software, with a range of sizes and configs
- Simulate over traces from target workloads
  - Instruction traces quick to capture
  - And (relatively) quick to simulate
- Enables easy estimation of cache behavior for a given workload
  - Accurate within a few percentage points for Icache, ITLB, and pre-decode
- Enables evaluation of different cache configs across a range of workloads

#### Intel Processor Trace Allows Modeling of Instruction Cache

#### **Instruction Cache Miss Rate**



### **Precise Events Are Incredibly Useful**



Precise Events Collect Eventing IP, Registers, Data Linear Address (some) and Do NOT Require and Performance Monitoring Interrupt to Collect

(intel)

### Utilizing PEBS Triggering on Non-Precise Events



Capability to Collect PEBS on Non-Precise Events Allows For Less Overhead, Better IP Tagging and Works When Interrupts Masked



### Conclusions

- Shifting toward definitive ways to debug performance
  - Need tools help to ensure this is all automated and to help innovate
- LBRs are now complemented with timing
  - Get exact timing to nanosecond granularity
- Intel Processor Trace is Augmenting LBRs
  - Being used for advanced debug
- Precise Event Based Sampling is being utilized on non-precise events
  - Allows for an extremely cheap methodology to collect events without performance monitoring interrupts and allows for better tagging of issues