# The inever died: Automata theory for reversing modern CPUs



#### RootedCON - March 2020









#### About me

l'm Pepe Vila (a.k.a. cgvwzq)

PhD student at the IMDEA Software Institute Worked as security consultant and pentester

Intern at Facebook and Microsoft Research

I used to mess with browsers and JavaScript... ...but fell into the side channel's rabbit hole







#### Motivation

Remember last year's "Cache and syphilis"?



dafuq is this pattern :S

#### Motivation

Knowing the cache replacement policy useful for finding eviction sets,



but also for optimal eviction strategies in rowhammer, or high bandwidth covert channels

|         |            |   |                 | _               |                       |       |         |                 |          |
|---------|------------|---|-----------------|-----------------|-----------------------|-------|---------|-----------------|----------|
|         |            |   | CPU             |                 | CPU                   |       |         |                 |          |
| Latency | 4-cycles   | I | L1 Cache        |                 | L1 Cache              | 32KB  | 8 ways  | private per     |          |
|         | 12-cycles  |   | L2 Cache        |                 | L2 Cache              | 256KB | 4 ways  | _ physical core | Ca       |
|         | 41-cycles  |   | L3 Cache        |                 |                       | 8MB   | 16 ways | } shared        | Capacity |
|         | 150-cycles |   |                 | 1 - Main Memory |                       | 16    | GB      |                 |          |
|         | ļ          | [ | Hard Disk / SSD | ] [             | Network Card<br>(NIC) |       |         |                 |          |

(data from Kaby Lake i7-8550U CPU)



- Memory partitioned in **memory blocks** (64 bytes =  $2^6$ )
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* 4)
- Cache sets have capacity for N cache lines (also known as ways or associativity)



- Memory partitioned in **memory blocks** (64 bytes = 2<sup>6</sup>)
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* 4)
- Cache sets have capacity for N cache lines (also known as ways or associativity)



- Memory partitioned in **memory blocks** (64 bytes = 2<sup>6</sup>)
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* 4)
- Cache sets have capacity for N cache lines (also known as ways or associativity)



- Memory partitioned in **memory blocks** (64 bytes =  $2^6$ )
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* 4)
- Cache sets have capacity for N cache lines (also known as ways or associativity)



- Memory partitioned in **memory blocks** (64 bytes = 2<sup>6</sup>)
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* associativity)
- Cache sets have capacity for N cache lines (also known as ways or associativity)

block 0 CPU 1 256KBs Cache memory address 2 Tag Data 3 Associativity ... Tag Set Offset Set 0 10 6 Set 1

- Memory partitioned in **memory blocks** (64 bytes = 2<sup>6</sup>)
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* associativity)
- Cache sets have capacity for N cache lines (also known as ways or associativity)

block 0 CPU 1 256KBs Cache memory address 2 Tag Data 3 Associativity ... Tag Set Offset Set 0 10 6 = Set 1

- Memory partitioned in **memory blocks** (64 bytes = 2<sup>6</sup>)
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* associativity)
- Cache sets have capacity for N cache lines (also known as ways or associativity)



- Memory partitioned in **memory blocks** (64 bytes = 2<sup>6</sup>)
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* associativity)
- Cache sets have capacity for N cache lines (also known as ways or associativity)

Memory



- Memory partitioned in **memory blocks** (64 bytes = 2<sup>6</sup>)
- Cache partitioned in equally sized cache sets (1024 = 2<sup>10</sup> = 256KB / (64 \* associativity)
- Cache sets have capacity for N cache lines (also known as ways or associativity)

- Cache set partition exploits programs' spatial locality
- Replacement policy decides which blocks to evict exploiting programs' **temporal locality**
- What does a replacement policy look like?
  - First Input First Output (FIFO), Least Recently Used (LRU), Pseudo-LRU, etc.
  - These examples keep track of the order or **ages** of blocks, and evict oldest one
- More complex policies nowadays, but same idea: maintain some metadata or **control state**

#### Caches as Mealy machines

- Natural **abstraction** for an individual cache set
- Input alphabet = set of memory blocks, e.g. {a,b,c} mapping to the same cache set
- **Output** alphabet = {H, M} (hit or miss) for the observable result of accessing a given block
- Every **state** represents the content of the cache set plus its control state (or metadata)



Example: 2-way FIFO cache with 3 blocks {a,b,c}

#### Previous work



#### Previous work

|                                | Others | Abel & Reineke    | Rueda's MS    |
|--------------------------------|--------|-------------------|---------------|
| Automatic                      |        | YES               | YES           |
| Supported class<br>of policies |        | Permutation-based | Deterministic |
| On real hardware               | YES    | YES               |               |
| Scalability                    |        | YES               |               |
| Human readable                 |        |                   |               |
| Correctness                    |        | YES               |               |



#### Our approach



int missIdx (int[4] state)
for(int i = 0; i < 4; i = i + 1)
if(state[i] == 3)
return i;</pre>

#### Explanation

#### Our approach



#### Explanation

#### Previous work vs. our approach

|                                | Others     | Abel & Reineke    | Rueda's MS    | Our           |
|--------------------------------|------------|-------------------|---------------|---------------|
| Automatic                      | NO         | YES               | YES           | YES           |
| Supported class<br>of policies | Individual | Permutation-based | Deterministic | Deterministic |
| On real hardware               | YES        | YES               |               | YES           |
| Scalability                    | NO         | YES               |               | YES           |
| Human readable                 | NO         |                   |               | YES           |
| Correctness                    | NO         | YES               |               | YES           |

#### CacheQuery: a hardware interface



int missIdx (int[4] state)
for(int i = 0; i < 4; i = i + 1)
if(state[i] == 3)
return i;</pre>

#### Explanation

#### CacheQuery: a hardware interface

- Frees the user from **low-level details** like set mapping, timing, cache filtering, code generation, and system's interferences.
- Accepts sequences of **blocks** decorated with an **optional tag**: ? indicates access should be profiled, ! indicates that block should be invalidated, no tag means access.
- Support for macros:
  - @ expansion, \_ wildcard, power operator, etc.
  - E.g. For assoc=4: @ x \_? expands to
    - (a b c d) x [a b c d]?, which expands to
    - [a b c d x a?, a b c d x b?, a b c d x c?, a b c d x d?]
    - and returns {M, H, H, H}

#### CacheQuery: demo

- Disable system's noise
- REPL interactive session
- Target specific level and set
- Ask arbitrary queries



int missIdx (int[4] state)
for(int i = 0; i < 4; i = i + 1)
if(state[i] == 3)
return i;</pre>

#### Explanation

• Why not learn directly from the cache?

- **Redundancy** Replacement policy is agnostic of the specific content
- Policy's logic should depend only on the **control state** (metadata)
- Cache's content management increases automata complexity and learning cost

• We abstract the replacement policy from the cache content management!

Input:





- Example of **concrete cache automaton** for 2-ways LRU with fixed input alphabet {a,b,c} and output {H,M}
- Example of corresponding abstract policy automaton, using input alphabet {h(0), h(1), m()} and output {\_,0,1}
- 12 vs. 2 states → much easier to learn!
- reduction of (associativity+1)! in most cases



#### LearnLib: an automata learning framework



int missIdx (int[4] state)
for(int i = 0; i < 4; i = i + 1)
if(state[i] == 3)
return i;</pre>

#### Explanation

#### Automata learning

• Dana Angluin's L\* algorithm:

"Learning regular sets from queries and counterexamples" (1987)

- Student-Teacher protocol. Student asks 2 types of questions to the teacher:
  - membership Is a word 'w' in the target language 'U'? Yes / No
     interaction with SUL (System Under Learning)
  - equivalence Does the automaton accept language 'U'? Yes / counterexample
     needs access to a specification or oracle
- Find the **minimal automaton** for U with **polynomial cost** in the number of states of the automaton and the length of longest counterexample

- Teacher knows language U = {aa, bb} (alphabet Σ={a, b})
- Student asks if 'ε', 'a', and 'b' are in U and obtains the following **Observation Table**:



- Table entries: (s,e) = 1 *iff* uv  $\in$  U summarizes all membership queries
- From an observation table we can directly **construct an automaton** if table is
  - **closed**  $\forall$ t∈S.Σ ∃s∈S row(t) = row(s)
  - **consistent**  $\forall$  s<sub>1</sub>,s<sub>2</sub> s.t. row(s<sub>1</sub>) = row(s<sub>2</sub>) →  $\forall$  a ∈ Σ row(s<sub>1</sub>.a) = row(s<sub>2</sub>.a)



#### **Observation Table:**





#### Observation Table:





It is closed and consistent Hypothesis: empty language!

Teacher says NO and returns: ce = aa

We need to extend S with 'ce' and all its prefixes



#### **Observation Table:**

3

0

0

0

?

3

a

aa

b

ab

aaa

aab

#### perform a more membership queries

(source: https://www.csa.iisc.ac.in/~deepakd/atc-2015/L Star Algo.pdf)



#### Observation Table:

0

0

0

3

a

aa

b

ab

aaa

aab



(source: https://www.csa.iisc.ac.in/~deepakd/atc-2015/L Star Algo.pdf)



#### **Observation Table:**



#### now it is closed and consistent



we make a new hypothesis, but teacher says NO: ce = bb

(source: https://www.csa.iisc.ac.in/~deepakd/atc-2015/L Star Algo.pdf)



(source: https://www.csa.iisc.ac.in/~deepakd/atc-2015/L Star Algo.pdf)

#### **Observation Table:**



table is closed and consistent, let's see if hypothesis is correct not?



nope ce = babb

• With one more step, we finally find the automaton accepting **U** = {aa,bb}



- The algorithm ensures that on every hypothesis the **automaton is minimal**.
- Teacher can give arbitrarily long counterexamples.

# LearnLib handles all the learning

- LearnLib is an open source Java framework for automata learning developed at the TU Dortmund University - <u>https://learnlib.de/</u>
- Angluin's L\* algorithm has been extended to Mealy machines:
  - Membership queries replaced by output queries
  - Equivalence queries approximated by test sequences for conformance testing
  - **Reset sequence** is bootstrapping problem, we solve it with Flush+Refill

**WP-method:** test sequence selection - given an upper bound on the number of states of the System Under Learning (SUL), guarantees equivalence



int missIdx (int[4] state)
for(int i = 0; i < 4; i = i + 1)
 if(state[i] == 3)
 return i:</pre>

#### **Explanation**

- Automata models are great, but if we want to understand what is really happening...
- This is only LRU with associativity 4, a fairly simple policy.



Domain knowledge or high-level view of a replacement policy:

- Each block has an associated age
- **Promotion** rule decides how the ages are updated upon a hit
- **Replacement** rule decides which block is evicted upon a miss
- **Insertion** rule decides the age of a new block
- Normalization rule describes how to normalize ages after/before a hit or miss (e.g. in MRU reset used bit when all are set)

With that domain knowledge, we "sketch" a **template** of how replacement policies looks like:

```
hit (state, line) :: States×Lines → States
  state = promote(state, line)
  state = normalize(state, line)
  return state
```

```
miss (state) :: States → States×Lines
Lines idx = -1
state = normalize(state, idx)
idx = evict(state)
state[idx] = insert(state, idx)
state = normalize(state, idx)
return ⟨state, idx⟩
```

With that domain knowledge, we "sketch" a **template** of how replacement policies looks like:

```
hit (state, line) :: States×Lines → States
  state = promote(state, line)
  state = normalize(state, line)
  return state
```

```
miss (state) :: States → States×Lines
Lines idx = -1
state = normalize(state, idx)
idx = evict(state)
state[idx] = insert(state, idx)
state = normalize(state, idx)
return ⟨state, idx⟩
```

Specify the grammar of the functions. For instance:

```
promote (state, pos) :: States×Lines → States
States final = state
if (??{boolExpr(state[pos])})
final[pos] = ??{natExpr(state[pos])}
for(i in Lines)
if(i != pos ∧ ??{boolExpr(state[pos], state[i])})
final[i] = ??{natExpr(state[i])}
return final
```

With that domain knowledge, we "sketch" a template of how replacement policies looks like:

```
hit (state, line) :: States×Lines → States
  state = promote(state, line)
  state = normalize(state, line)
  return state
```

```
miss (state) :: States → States×Lines
Lines idx = -1
state = normalize(state, idx)
idx = evict(state)
state[idx] = insert(state, idx)
state = normalize(state, idx)
return ⟨state, idx⟩
```

Specify the grammar of the functions. For instance:

```
promote (state, pos) :: States×Lines → States
States final = state
if (??{boolExpr(state[pos])})
final[pos] = ??{natExpr(state[pos])}
for(i in Lines)
if(i != pos ∧ ??{boolExpr(state[pos], state[i])})
final[i] = ??{natExpr(state[i])}
return final
```

And encode the automaton's output and transition functions as **constraints**.

### **Case Studies**

- Learning from software simulated caches
- Learning from hardware
- Synthesizing Explanations



# Case Study: Learning from Software-Simulated Caches

- Support for a broader class of policies than previous work
- Scale up to larger associativities than previous work
- Number of states still grows exponentially with associativity :(

| Policy   | Assoc. | # States | Time           |
|----------|--------|----------|----------------|
| FIFO     | 2      | 2        | 0h0m0.14s      |
|          |        |          |                |
|          | 16     | 16       | 0h0m0.38s      |
| LRU      | 2      | 2        | 0h0m0.10s      |
|          | 4      | 24       | 0h0m0.22s      |
|          | 6      | 720      | 0h0m32.70s     |
| PLRU     | 2      | 2        | 0.10 s         |
|          | 4      | 8        | 0.22 s         |
|          | 8      | 128      | 1.46 s         |
|          | 16     | 32768    | 34 h 18 m 25 s |
| MRU      | 2      | 2        | 0h0m0.10s      |
|          | 4      | 14       | 0h0m0.16s      |
|          | 6      | 62       | 0h0m0.61s      |
|          | 8      | 254      | 0h0m8.82s      |
|          | 10     | 1022     | 0 h 5 m 58 s   |
|          | 12     | 4094     | 3 h 59 m 20 s  |
|          | 2      | 2        | 0h0m0.10s      |
| LIP      | 4      | 24       | 0h0m0.26s      |
|          | 6      | 720      | 0h0m31.97s     |
| SRRIP-HP | 2      | 12       | 0h0m0.16s      |
|          | 4      | 178      | 0h0m1.46s      |
|          | 6      | 2762     | 0 h 9 m 38 s   |
| SRRIP-FP | 2      | 16       | 0h0m0.19s      |
|          | 4      | 256      | 0h0m7.27s      |
|          | 6      | 4096     | 2h 30m 51s     |

**Table 2.** Learning policies from software-simulated caches (with 36 hours timeout). We omit FIFO's intermediate results.

### Case Study: Learning from Hardware

| CPU                     | Cache level | Assoc. | Slices | Sets per slice |
|-------------------------|-------------|--------|--------|----------------|
| i7-4790<br>(Haswell)    | L1          | 8      | 1      | 64             |
|                         | L2          | 8      | 1      | 512            |
|                         | L3          | 16     | 4      | 2048           |
| i5-6500<br>(Skylake)    | L1          | 8      | 1      | 64             |
|                         | L2          | 4      | 1      | 1024           |
|                         | L3          | 12     | 8      | 1024           |
| i7-8850U<br>(Kaby Lake) | L1          | 8      | 1      | 64             |
|                         | L2          | 4      | 1      | 1024           |
|                         | L3          | 16     | 8      | 1024           |

# Case Study: Learning from Hardware

#### Challenges:

- Not all sets implement the same policy (set-duelling) → we identify leader sets
- Not all leader sets are deterministic (probabilistic and adaptive policies) → :(
- L3 has too large associativities we use Intel's CAT to virtually reduce associativity
- Reset sequences not 100% reliable + required some manual adjustment

# Case Study: Learning from Hardware

| CPU                     | Level | Assoc.                       | Sets                                                         | States         | Policy | Reset Seq. |
|-------------------------|-------|------------------------------|--------------------------------------------------------------|----------------|--------|------------|
|                         | L1    | 8                            | 0 - 63                                                       | 128            | PLRU   | @@         |
| i7-4790                 | L2    | 8                            | 0 - 511                                                      | 128            | PLRU   | @          |
| (Haswell)               | To    | 11                           | 512 – 575 (only for slice 0)                                 | <del></del> .8 |        |            |
| L3                      | 16    | 768 – 831 (only for slice 0) |                                                              | <del></del> :  | 1      |            |
| :5 (500                 | L1    | 8                            | 0 - 63                                                       | 128            | PLRU   | @          |
| i5-6500<br>(Skylake)    | L2    | 4                            | 0 - 1023                                                     | 160            | New1   | DCBA@      |
| (Skylake)               | L3    | 4 <sup>†</sup>               | 0 33 132 165 264 297 396 429 528 561 660 693 792 825 924 957 | 175            | New2   | @          |
| T OFFOLL                | L1    | 8                            | 0 - 63                                                       | 128            | PLRU   | @          |
| i7-8550U<br>(Kaby Lake) | L2    | 4                            | 0 - 1023                                                     | 160            | New1   | DCBA@      |
| (Ruby Luke)             | L3    | 4 <sup>†</sup>               | 0 33 132 165 264 297 396 429 528 561 660 693 792 825 924 957 | 175            | New2   | @          |

**Table 4.** Results of learning policies from hardware caches. † indicates that the associativity has been virtually reduced using CAT. The 'Sets' column specifies the analyzed cache sets (unless otherwise specified, the findings apply to all slices).

# Case Study: Synthesizing Explanations

| Policy   | States | Time |
|----------|--------|------|
| FIFO     | 4      | 18ms |
| LRU      | 24     | 81ms |
| PLRU     | 8      | -    |
| LIP      | 24     | 4s   |
| MRU      | 14     | 40s  |
| SRRIP-HP | 178    | 105h |
| SRRIP-FP | 256    | 48h  |
| New1     | 160    | 9h   |
| New2     | 175    | 26h  |

# Case Study: Synthesizing Explanations

Description of Skylake/Kaby Lake L2's (New1):

Initial insertion on a flushed cache set:

int[4] s0 = {3,3,3,0};

```
int[4] hitState (int[4] state, int pos)
 int[4] final = state;
 // Promotion
 final[pos] = 0;
 // Is there a block with age 3?
 bit found = 0;
 for(int j = 0; j < 4; j = j + 1)
   if(!found)
     for(int i = 0; i < 4; i = i + 1)
       if(!found && final[i] == 3)
            found = 1;
   // If not, increase all blocks except promoted one
   if(!found)
     for(int i = 0; i < 4; i = i + 1)
       if(i != pos)
         final[i] = final[i] + 1;
  return final;
```

```
int[4] missState (int[4] state)
  int[4] final = state;
  int replace = missIdx(state);
  // Insertion
 final[replace] = 1;
  // Is there a block with age 3?
  bit found = 0:
  for(int j = 0; j < 4; j = j + 1)
    if(!found)
      for(int i = 0; i < 4; i = i + 1)
       if(!found && final[i] == 3)
         found = 1:
    // If not, increase all blocks except inserted one
    if(!found)
     for(int i = 0; i < 4; i = i + 1)
       if(replace != i)
         final[i] = final[i] + 1;
  return final:
```

```
// Replace first block with age 3 starting from the left
int missIdx (int[4] state)
for(int i = 0; i < 4; i = i + 1)
    if(state[i] == 3)
    return i;</pre>
```



# Case Study: Synthesizing Explanations

Description of Skylake/Kaby Lake L3's (New2):

Initial insertion on a flushed cache set:

int[4] s0 = {3,3,3,3};

```
int[4] hitState (int[4] state, int pos)
  int[4] final = state;
 // Promotion
  if (final[pos] > 1)
   final[pos] = 1;
  else
    final[pos] = 0;
  // Is there a block with age 3?
  bit found = 0;
  for(int j = 0; j < 4; j = j + 1)
   if(!found)
     for(int i = 0; i < 4; i = i + 1)
       if(!found && final[i] == 3)
            found = 1;
    // If not, increase all blocks
    if(!found)
      for(int i = 0; i < 4; i = i + 1)
          final[i] = final[i] + 1;
  return final;
```

```
int[4] missState (int[4] state)
  int[4] final = state;
  int replace = missIdx(state);
  // Insertion
 final[replace] = 1;
  // Is there a block with age 3?
  bit found = 0:
  for(int j = 0; j < 4; j = j + 1)
    if(!found)
      for(int i = 0; i < 4; i = i + 1)
       if(!found && final[i] == 3)
         found = 1:
   // If not, increase all blocks
    if(!found)
      for(int i = 0; i < 4; i = i + 1)
          final[i] = final[i] + 1;
  return final;
```

// Replace first block with age 3 starting from the left
int missIdx (int[4] state)
for(int i = 0; i < 4; i = i + 1)
 if(state[i] == 3)
 return i;</pre>

NEW



## Conclusions

- End-to-end solution for learning deterministic hardware replacement policies
- We are able to automatically infer human-readable descriptions
- We uncover 2 previously undocumented policies used in recent Intel processors
- All our contributions are independent and ready to use in alternative workflows



https://github.com/cgvwzq/cachequery







## Thank you for listening! Questions?





https://github.com/cgvwzq/cachequery









Adaptive Insertion Policies for High Performance Caching https://researcher.watson.ibm.com/researcher/files/us-moingureshi/papers-dip.pdf

Intel Ivy Bridge Cache Replacement Policy <a href="http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/">http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/</a>

Measurement-based Modeling of the Cache Replacement Policy <a href="http://embedded.cs.uni-saarland.de/publications/CacheModelingRTAS2013.pdf">http://embedded.cs.uni-saarland.de/publications/CacheModelingRTAS2013.pdf</a>

Learning Cache Replacement Policies using Register Automata <a href="https://uu.diva-portal.org/smash/get/diva2:678847/FULLTEXT01.pdf">https://uu.diva-portal.org/smash/get/diva2:678847/FULLTEXT01.pdf</a>

#### Extra material

### Extra: Adaptive Policies and Leader Sets

- We use thrashing sequences (e.g. @ M @?) on a per cache set basis to identify leader sets:
  - Haswell i7-4790:
    - sets 512 575 in slice 0 fixed policy susceptible to thrashing.
    - sets 768 831 in slice 0 fixed thrash resistant policy (seems not deterministic).
    - rest of sets follow the policy producing less misses.
  - Skylake i5-6500 and Kaby Lake i7-8550U:
    - sets whose indexes satisfy ((((set & 0x3e0) >> 5) ⊕ (set & 0x1f)) = 0x0) ∧ ((set & 0x2) = 0x0) fixed policy susceptible to thrashing (group 1)
    - rest of sets seem to use an adaptive policy
    - but sets whose indexes satisfy ((((set & 0x3e0) >> 5) ⊕ (set & 0x1f)) = 0x1f) ∧ ((set & 0x2) = 0x2) change differently (group 2), still WIP for this

group 1: 0 33 132 165 264 297 396 429 528 561 660 693 792 825 924 957 group 2: 31 62 155 186 279 310 403 434 527 558 651 682 775 806 899 930