This is section 8 of my lab code notes for my self-taught MIT6.S081 operating system course: Locks. The approximate duration of this lab: 14 hours.

Courses address: pdos.csail.mit.edu/6.S081/2020… Lab address: pdos.csail.mit.edu/6.S081/2020… My code address: github.com/Miigon/my-x… Commits: github.com/Miigon/my-x…

The code comments in this article were added while writing the blog, and the code in the original repository may be missing comments or not identical.

Lab 8: Locks

Code was redesigned to reduce lock contention and improve system parallelism on multicore machines.

Memory allocator (moderate)

Kmem lock contention in kalloc implementation is reduced by splitting the free memory linked list in KMEM.

Principle and Analysis

In kalloc’s original implementation, freelist was used to join the free physical pages themselves directly as a list item (thus avoiding the use of extra space) into a linked list, remove the physical pages from the list when allocated, and put them back into the list when reclaimed.

// kernel/kalloc.c
struct {
  struct spinlock lock;
  struct run *freelist;
} kmem;
Copy the code

Allocating physical pages

// kernel/kalloc.c
void *
kalloc(void)
{
  struct run *r;

  acquire(&kmem.lock);
  r = kmem.freelist; // Retrieve a physical page. The page entry itself is a physical page.
  if(r)
    kmem.freelist = r->next;
  release(&kmem.lock);

  if(r)
    memset((char*)r, 5, PGSIZE); // fill with junk
  return (void*)r;
}

Copy the code

Here, either assigning or freeing physical pages requires modifying the freelist list. Because modification is a multi-step operation, locking is necessary to maintain multithreaded consistency. However, this design also prevents multiple threads from applying for memory concurrently, which limits the efficiency of concurrency.

Evidence of this is frequent lock competition on KMEM locks:

$ kalloctest start test1 test1 results: --- lock kmem/bcache stats lock: kmem: #fetch-and-add 83375 #acquire() 433015 lock: bcache: #fetch-and-add 0 #acquire() 1260 --- top 5 contended locks: lock: Kmem: #fetch and add 83375 #acquire() 433015 // kmem: #fetch and add 83375 #acquire() 433015 #fetch-and-add 23737 #acquire() 130718 lock: virtio_disk: #fetch-and-add 11159 #acquire() 114 lock: proc: #fetch-and-add 5937 #acquire() 130786 lock: proc: #fetch-and-add 4080 #acquire() 130786 tot= 83375 test1 FAILCopy the code

Here is an idea of profile optimization first. If a large lock does not cause significant performance problems, sometimes a large lock is sufficient. Optimize only when you are absolutely sure that the performance hot spot is in the lock. “Premature optimization is the root of all evil.”

The solution to performance hotspots is to change a shared resource to an unshared resource. Lock competition optimization generally has several ideas:

  • Share only when it must be shared (corresponding to splitting resources from CPU shares into per-CPU independent)
  • When sharing is required, the time spent in key areas should be reduced as much as possible (corresponding to “big lock to small lock”, reducing the lock granularity).

The lab’s experimental goal is to allocate independent freelist for each CPU, so that multiple cpus can allocate physical pages concurrently without being mutually exclusive, thus improving parallelism.

However, since there are not enough free pages in one CPU freelist, it is still necessary to “steal” memory pages from other CPUS ‘freelists, so a CPU’s freelist is not only accessed by its corresponding CPU. It is also possible for pages to be accessed by other cpus while “stealing” memory, so a separate lock is still needed to protect freelist for each CPU. However, it is relatively rare for a CPU freelist to run out of free pages, so overall performance is still faster than a single KMEM large lock. In the best case, where pages are not stolen across cpus, there is no lock contention for these small locks.

Code implementation

// kernel/kalloc.c
struct {
  struct spinlock lock;
  struct run *freelist;
} kmem[NCPU]; // Assign a separate freelist to each CPU and protect it with a separate lock.

char *kmem_lock_names[] = {
  "kmem_cpu_0"."kmem_cpu_1"."kmem_cpu_2"."kmem_cpu_3"."kmem_cpu_4"."kmem_cpu_5"."kmem_cpu_6"."kmem_cpu_7"};void
kinit(a)
{
  for(int i=0; i<NCPU; i++) {// Initialize all locks
    initlock(&kmem[i].lock, kmem_lock_names[i]);
  }
  freerange(end, (void*)PHYSTOP);
}

Copy the code
// kernel/kalloc.c
void
kfree(void *pa)
{
  struct run *r;

  if(((uint64)pa % PGSIZE) ! =0| | -char*)pa < end || (uint64)pa >= PHYSTOP)
    panic("kfree");

  // Fill with junk to catch dangling refs.
  memset(pa, 1, PGSIZE);

  r = (struct run*)pa;

  push_off();

  int cpu = cpuid();

  acquire(&kmem[cpu].lock);
  r->next = kmem[cpu].freelist;
  kmem[cpu].freelist = r;
  release(&kmem[cpu].lock);

  pop_off();
}

void *
kalloc(void)
{
  struct run *r;

  push_off();

  int cpu = cpuid();

  acquire(&kmem[cpu].lock);

  if(! kmem[cpu].freelist) {// no page left for this cpu
    int steal_left = 64; // steal 64 pages from other cpu(s)
    for(int i=0; i<NCPU; i++) {if(i == cpu) continue; // no self-robbery
      acquire(&kmem[i].lock);
      struct run *rr = kmem[i].freelist;
      while(rr && steal_left) {
        kmem[i].freelist = rr->next;
        rr->next = kmem[cpu].freelist;
        kmem[cpu].freelist = rr;
        rr = kmem[i].freelist;
        steal_left--;
      }
      release(&kmem[i].lock);
      if(steal_left == 0) break; // done stealing
    }
  }

  r = kmem[cpu].freelist;
  if(r)
    kmem[cpu].freelist = r->next;
  release(&kmem[cpu].lock);

  pop_off();

  if(r)
    memset((char*)r, 5, PGSIZE); // fill with junk
  return (void*)r;
}

Copy the code

In this case, 64 pages are “stolen” from other cpus when the memory pages are insufficient. The value here is arbitrary. In real scenarios, it is best to measure and select the appropriate value to keep the frequency of “stolen” pages as low as possible.

Run kalloctest again:

$ kalloctest
start test1
test1 results:
--- lock kmem/bcache stats
lock: kmem_cpu_0: #fetch-and-add 0 #acquire() 35979
lock: kmem_cpu_1: #fetch-and-add 0 #acquire() 195945
lock: kmem_cpu_2: #fetch-and-add 0 #acquire() 201094
lock: bcache: #fetch-and-add 0 #acquire() 1248
--- top 5 contended locks:
lock: proc: #fetch-and-add 22486 #acquire() 132299
lock: virtio_disk: #fetch-and-add 16002 #acquire() 114
lock: proc: #fetch-and-add 11199 #acquire() 132301
lock: proc: #fetch-and-add 5330 #acquire() 132322
lock: proc: #fetch-and-add 4874 #acquire() 132345
tot= 0
test1 OK
start test2
total free number of pages: 32499 (out of 32768)
.....
test2 OK
Copy the code

As you can see, kMEM has reduced lock contention to 0 (from ~83375).

Buffer cache (hard)

If multiple processes use the file system intensively, they will likely contend for bcache.lock, which protects the disk block cache in kernel/bio.c. bcachetest creates several processes that repeatedly read different files in order to generate contention on bcache.lock;

Bcache. lock causes serious lock contention when multiple processes are using the file system at the same time. The bcache.lock lock is used to protect the disk cache. It was originally designed to prevent multiple processes from simultaneously applying for and releasing the disk cache.

The principle of

Because unlike Kalloc, where a physical page allocation is owned by a single process, the block cache in bCache is shared by multiple processes (and, by extension, by multiple cpus) (since multiple processes can access the same block at the same time). So the kMEM approach of pre-partitioning a portion of dedicated pages for each CPU won’t work here.

As mentioned earlier:

Lock competition optimization generally has several ideas:

  • Share only when it must be shared (corresponding to splitting resources from CPU shares into per-CPU independent)
  • When sharing is required, the time spent in key areas should be reduced as much as possible (corresponding to “big lock to small lock”, reducing the lock granularity).

In this case, bcache is a “must share” case, so the second idea is to reduce the granularity of the lock and use a more elaborate lock scheme to reduce the probability of contention.

// kernel/bio.c
struct {
  struct spinlock lock;
  struct buf buf[NBUF];

  // Linked list of all buffers, through prev/next.
  // Sorted by how recently the buffer was used.
  // head.next is most recent, head.prev is least.
  struct buf head;
} bcache;
Copy the code

The design of the original xv6, using two-way linked list to store all the blocks in the cache, every time trying to get a block blockno, will traverse the list, if the target block already exists in the cache is returned directly, if not then select a most recently, long time unused, and the reference count to zero buf block as its blocks in the cache, And return.

The new improved solution allows you to create a hash table from blockNO to BUF and lock each bucket individually. Thus, lock contention occurs only if blocks accessed by two processes are simultaneously hashed into the same bucket. When there are not enough free BUFs in the bucket, buFs are fetched from other buckets.

The idea is very simple, but concrete implementation, need to pay attention to the deadlock problem. Many of the deadlock issues here are subtle and undetectable by BCacheTest, but it is possible to trigger deadlocks on a running system. I’ve seen a lot of other students’ blogs on the Internet that have passed, and it’s not noticed in the code.

The deadlock problem

Consider our new design, which first defines the hash table bufmap in bcache and sets the lock for each bucket:

// kernel/bio.h
struct {
  struct buf buf[NBUF];
  struct spinlock eviction_lock;

  // Hash map: dev and blockno to buf
  struct buf bufmap[NBUFMAP_BUCKET];
  struct spinlock bufmap_locks[NBUFMAP_BUCKET];
} bcache;
Copy the code

Bget (uint dev, uint blockno) : first, the bucket corresponding to blockno is scanned for caches. If not, the bucket is searched for a buF that has not been used for the most recent time and has no reference. It is then moved back to the bucket (rehash) corresponding to blockno and returned as a cache for BlockNo.

It would be easy to write code like this:

bget(dev, blockno) {
  key := hash(dev, blockno);

  acquire(bufmap_locks[key]); // Obtain the lock of the key bucket
  
  // Check whether the cache for blockNo exists. If yes, the value is returned. If no, continue
  if(b := look_for_blockno_in(bufmap[key])) {
    b->refcnt++
    release(bufmap_locks[key]);
    return b;
  }

  // Find the deportable cache b
  
  least_recently := NULL;
  
  for i := [0, NBUFMAP_BUCKET) { // Walk through all the buckets
    acquire(bufmap_locks[i]);    // Get the lock on bucket I

    b := look_for_least_recently_used_with_no_ref(bufmap[key]);
    // If you find a free block that has not been used for a longer time
    if(b.last_use < least_recently.last_use) {  
      least_recently := b;
    }

    release(bufmap_locks[i]);   // Release the lock on bucket I
  }

  b := least_recently;

  // Remove the cache that b originally stored (remove it from the original bucket)
  evict(b);

  // Add b to the new bucket
  append(bucket[key], b);

  release(bufmap_locks[key]); // Release the lock on the key bucket

  // Set the properties of b
  setup(b);
  return b;
}
Copy the code

The above code looks reasonable, but there are two problems, one causing the result to fail and one causing a deadlock.

Fault 1: The running result is incorrect

The first problem is obvious. During cache eviction, the lock of each bucket is acquired before the bucket is scanned, but the lock is released after each bucket is scanned. From the moment the lock is released, the most recent unused idle BUF acquired is no longer reliable. (release(bufmap_locks[I]); After), but before deleting b from the original bucket (evict(b); It is entirely possible for another CPU to call a bget request for B, making the reference count of B non-zero. It’s not safe for us to expel B at this point.

The solution is not complicated. When scanning buckets, make sure to find the latest and longest idle BUF, do not release the lock of the bucket, and continue to hold the lock of the corresponding bucket until the expulsion is complete.

The invariants maintained here are: “Scanned BUFs remain evictible until eviction is complete” and “if a block of BUFS exists in the bucket, that BUF is available and bGET can return that BUF directly”.

bget(dev, blockno) {
  acquire(bufmap_locks[key]); // Obtain the key bucket lock
  
  // Check whether the cache for blockNo exists. If yes, the value is returned. If no, continue
  if(b := look_for_blockno_in(bufmap[key])) {
    b->refcnt++
    release(bufmap_locks[key]);
    return b;
  }

  // Cache does not exist, look for deportable cache B
  
  least_recently := NULL;
  holding_bucket := - 1;
  
  for i := [0, NBUFMAP_BUCKET) { // Walk through all the buckets
    acquire(bufmap_locks[i]);    // Get the lock on bucket I

    b := look_for_least_recently_used_with_no_ref(bufmap[key]);
    // If you find a free block that has not been used for a longer time (new least_recently)
    
    if(b.last_use >= least_recently.last_use) {
      release(bufmap_locks[i]);   // No new least_recently was found in the bucket

    } else {
      // b.last_use < least_recently.last_use
      least_recently := b;

      Holding_bucket < I
      if(holding_bucket ! =- 1&& holding_bucket ! = key) release(bufmap_locks[holding_bucket]);// Hold the lock on bucket I......
      holding_bucket := i;
    }
  }

  b := least_recently;

  Bufmap_locks [holding_bucket] bufmap_locks[holding_bucket]
  // Remove the cache that b originally stored (remove it from the original bucket)
  evict(b);
  release(bufmap_locks[holding_bucket]); // Release the lock on the bucket where B was

  // Add b to the new bucket
  append(bucket[key], b);

  release(bufmap_locks[key]); // Release the key bucket lock

  // Set the properties of b
  setup(b);
  return b;
}
Copy the code

Problem 2: Resulting in deadlock

The second problem with blockNo is that we initially acquired its lock while iterating through the bucket to see if the cache existed. If blockno does not exist in the cache, we need to hold the key bucket lock and iterate through all buckets and obtain each lock in turn. Consider this case:

Let's say the hash of block b1 is 2, # b2 piece of hash value is five and two pieces are not cached before running -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- CPU1 CPU2 -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- bget (dev, b1) bget (dev, b2) | | V V for barrel lock acquisition of 2 barrels of 5 lock | | V V cache does not exist, iterate through all the barrels The cache was not found, Iterate through all the barrels | | V V... Traversal and barrels | 2 attempts to acquire the barrel 2 lock | | V V traverse to the barrel 5 barrel 2 lock held by CPU1, waiting for release the lock attempts to acquire the barrel 5 barrels | V 5 lock held by CPU2, waiting for release! CPU1 waits for CPU2, and CPU2 waits for CPU1, deadlocked!Copy the code

Here, the loop waits because CPU1 holds lock 2 and applies for lock 5, while CPU2 holds lock 5 and applies for lock 2.

Review the four conditions for deadlocks:

  1. Mutex (a resource can only belong to one thread at any time)
  2. Request hold (thread holding a lock, to apply for another lock)
  3. Non-stripping (external forces do not force a thread to strip resources it already owns)
  4. Loop wait (the order in which resources are requested forms a loop)

Breaking any one of the four conditions breaks a deadlock. To try to solve this deadlock problem, we consider the feasibility of breaking each condition:

  1. Mutual exclusion: in this case, a bucket can only be processed by one CPU (thread) at a time, and the mutual exclusion condition is required and cannot be broken.
  2. The request to keep
  3. Unstripped: while traversing a bucket, forcibly release a lock on one side when a loop request appears? Even if it can be detected, the bGET request from the party forcing the lock to be released will fail, causing the filesystem related system call to fail. This is not feasible.
  4. Loop wait: Change the access order, such that only the bucket to the left of the current key is always iterated, so that no loop occurs no matter what? Deadlocks can be resolved, but if blockno hashes to the first bucket and cache missed, there will be no caching evictions to free up new blocks for use (because there are no buckets left of the first bucket).

Starting with “mutually exclusive”, “undeprived”, and “loop wait” conditions does not solve the deadlock problem, so you have to consider “request hold”.

The reason for the deadlock here is that we are holding one lock and trying to apply for another lock, and the request sequence is looped. Since a looped order of requests is inevitable, the only option is to abandon the bucket lock previously held by key before applying for any other bucket locks. After finding and evicting the most recent and oldest unused block B, reacquire the bucket lock for key and add B to the bucket.

The code looks like this:

bget(dev, blockno) {
  acquire(bufmap_locks[key]); // Obtain the key bucket lock
  
  // Check whether the cache for blockNo exists. If yes, the value is returned. If no, continue
  if(b := look_for_blockno_in(bufmap[key])) {
    b->refcnt++
    release(bufmap_locks[key]);
    return b;
  }

  release(bufmap_locks[key]); // Release the key lock first to prevent loop waiting during search eviction

  // Cache does not exist, look for deportable cache B
  
  holding_bucket := - 1;
  for i := [0, NBUFMAP_BUCKET) {
    acquire(bufmap_locks[i]); // The request does not hold the key bucket lock, and no loop waiting occurs
    if(b := look_for_least_recently_used_with_no_ref(bufmap[key])) {
      if(holding_bucket ! =- 1) release(bufmap_locks[holding_bucket]);
      holding_bucket := i;
      // If a new block that has not been used for a longer time is found, the lock on the bucket of the original block is released, and the lock on the bucket of the new block is kept...
    } else{ release(bufmap_locks[holding_bucket]); }}// Remove the cache that b originally stored (remove it from the original bucket)
  evict(b);
  release(bufmap_locks[holding_bucket]); // Release the lock on b's bucket

  acquire(bufmap_locks[key]); // Get the key bucket lock again
  append(b, bucket[key]);     // Add b to the new bucket
  release(bufmap_locks[key]); // Release the key bucket lock


  // Set the properties of b
  setup(b);
  return b;
}

Copy the code

In this way, the lock is acquired either without any other lock or with only the lock to the left of the bucket at any point in the BGET (the bucket is always accessed from small to large while iterating through all buckets for deportable cache B), so there is never a loop and deadlocks are avoided. But such a scheme would create new problems.

The new problem

Notice that before we start the process of “search all buckets for deportable BUFs”, we release the key lock (key is the hash of the requested blockno) to prevent loop waiting, and do not regain the key lock until the process of traversing all buckets and deporting the most recent and longest unused idle BUFs is complete. The problem is that after the key bucket lock is released, the first critical area (the “find out if the cache for blockno exists, return if not” area) is not protected by the lock. This means that between “after releasing the key lock” and “before reacquiring the key lock”, that is, when we do expulsion + reallocation, it is entirely possible for another CPU to access the same BlockNo, acquire the key lock, and pass the initial “cache does not exist” test. Then it also goes into eviction + reallocation, causing the “multiple caches in one block” error.

How to guarantee that the same block will not have two caches?

The problem is relatively tricky, and the limitations we know so far are:

  • Do not hold a bucket lock with a key while traversing the bucket for a deportable BUF, otherwise a deadlock will occur.
  • If the key bucket lock is not held, other cpus may access the same blockno and complete expulsion + redistribution. As a result, the same blockno is cached repeatedly.

I have to admit that I don’t have a particularly good method in mind, but only a scheme that can sacrifice a little efficiency but ensure safety in extreme cases:

  • Add eviction_lock to limit the eviction + reallocation process to a single thread

    Note that the eviction_lock should be obtained after the bucket lock is released. Reverse write causes deadlock between eviction_lock and bucket lock. (Thread 1 requests eviction_lock on bucket A, thread 2 requests eviction_lock on bucket A)

    bget(dev, blockno) {
      acquire(bufmap_locks[key]); // Obtain the key bucket lock
      
      // Check whether the cache for blockNo exists. If yes, the value is returned. If no, continue
      if(b := look_for_blockno_in(bufmap[key])) {
        b->refcnt++
        release(bufmap_locks[key]);
        return b;
      }
      
      // Notice the order of acquire and release here
      release(bufmap_locks[key]); // Release the key bucket lock first to prevent loop deadlock during search eviction
      acquire(eviction_lock);     // Obtain the ejection lock to prevent multiple cpus from being expelled at the same time
    
      // Cache does not exist, look for deportable cache B
      
      / /...
    
      acquire(bufmap_locks[key]); // Get the key bucket lock again
      append(b, bucket[key]);     // Add b to the new bucket
      release(bufmap_locks[key]); // Release the key bucket lock
    
      release(eviction_lock);     // Release the eviction lock
    
      // Set the properties of b
      setup(b);
      return b;
    }
    Copy the code
  • After eviction_lock is obtained, the cache for BlockNo is checked again. If it is returned, continue executing if not

    bget(dev, blockno) {
      acquire(bufmap_locks[key]); // Obtain the key bucket lock
      
      // Check whether the cache for blockNo exists. If yes, the value is returned. If no, continue
      if(b := look_for_blockno_in(bufmap[key])) {
        b->refcnt++
        release(bufmap_locks[key]);
        return b;
      }
      
      // Notice the order of acquire and release here
      release(bufmap_locks[key]); // Release the key bucket lock first to prevent loop deadlock during search eviction
      acquire(eviction_lock);     // Obtain the ejection lock to prevent multiple cpus from being expelled at the same time
    
      // ** Check whether the blockno cache exists **. If yes, the result is returned. If no, continue
      // There is no other thread that can do eviction_lock, so
      // No other thread can change the structure of the bufmap[key] bucket list, so there is no prior fetching
      // It is safe to start traversal directly with the corresponding bucket lock.
      if(b := look_for_blockno_in(bufmap[key])) {
        acquire(bufmap_locks[key]); // must get, protect non-atomic operation 'refcnt++'
        b->refcnt++
        release(bufmap_locks[key]);
    
        release(eviction_lock);
        return b;
      }
    
      // Cache does not exist, look for deportable cache B
      
      / /...
    
      acquire(bufmap_locks[key]); // Get the key bucket lock again
      append(b, bucket[key]);     // Add b to the new bucket
      release(bufmap_locks[key]); // Release the key bucket lock
    
      release(eviction_lock);     // Release the eviction lock
    
      // Set the properties of b
      setup(b);
      return b;
    }
    Copy the code

Thus, even if multiple threads request the same blockno at the same time, and all threads happen to pass the initial “does the cache of blockno exist” judgment, and the result is “no cache exists”, after entering the eviction_lock-protected eviction + reallocation code, Only the first thread that enters can actually expel + reallocate.

Eviction_lock is released only after the first thread enters and ejects + reallocation. At this time, the blockno cache has been changed from non-existent to existent, and all subsequent threads will be blocked by the second “whether blockno cache exists” judgment code after entering, and directly return the allocated cache BUF. The same blockno will not be expelled + reassigned repeatedly.

This has the advantage of ensuring that no deadlocks occur during lookups and that there are no extreme cases where a block produces multiple caches. On the downside, the introduction of global EViction_lock makes the previously concurrent traversal eviction process less concurrency. Each cache miss will result in an additional bucket traversal overhead.

However, since cache misses themselves (hopefully) are rare events, and since subsequent reads from disk are orders of magnitude longer than a bucket traversal, I think the overhead is acceptable.

Complete pseudocode

bget(dev, blockno) {
  acquire(bufmap_locks[key]); // Obtain the key bucket lock
  
  // Check whether the cache for blockNo exists. If yes, the value is returned. If no, continue
  if(b := look_for_blockno_in(bufmap[key])) {
    b->refcnt++
    release(bufmap_locks[key]);
    return b;
  }
  
  // Notice the order of acquire and release here
  release(bufmap_locks[key]); // Release the key bucket lock first to prevent loop deadlock during search eviction
  acquire(eviction_lock);     // Obtain the ejection lock to prevent multiple cpus from being expelled at the same time

  // ** Check whether the blockno cache exists **. If yes, the result is returned. If no, continue
  // There is no other thread that can do eviction_lock, so
  // No other thread can change the structure of the bufmap[key] bucket list, so there is no prior fetching
  // It is safe to start traversal directly with the corresponding bucket lock.
  if(b := look_for_blockno_in(bufmap[key])) {
    acquire(bufmap_locks[key]); // must get, protect non-atomic operation 'refcnt++'
    b->refcnt++
    release(bufmap_locks[key]);

    release(eviction_lock);
    return b;
  }

  // Cache does not exist, look for deportable cache B
  
  holding_bucket := - 1; // The bucket lock currently held
  for i := [0, NBUFMAP_BUCKET) {
    acquire(bufmap_locks[i]); // The request does not hold the key bucket lock, and no loop waiting occurs
    if(b := look_for_least_recently_used_with_no_ref(bufmap[key])) {
      if(holding_bucket ! =- 1) release(bufmap_locks[holding_bucket]);
      holding_bucket := i;
      // If a new block that has not been used for a longer time is found, the lock on the bucket of the original block is released, and the lock on the bucket of the new block is kept...
    } else {
      release(bufmap_locks[holding_bucket]);
    }
  }

  acquire(bufmap_locks[key]); // Get the key bucket lock again
  append(b, bucket[key]);     // Add b to the new bucket
  release(bufmap_locks[key]); // Release the key bucket lock

  release(eviction_lock);     // Release the eviction lock

  // Set the properties of b
  setup(b);
  return b;
}

Copy the code

The complete code

struct buf {
  int valid;   // has data been read from disk?
  int disk;    // does disk "own" buf?
  uint dev;
  uint blockno;
  struct sleeplock lock;
  uint refcnt;
  uint lastuse; // *newly added, used to keep track of the least-recently-used buf
  struct buf *next;
  uchar data[BSIZE];
};
Copy the code
// kernel/bio.c

// bucket number for bufmap
#define NBUFMAP_BUCKET 13
// hash function for bufmap
#define BUFMAP_HASH(dev, blockno) ((((dev)<<27)|(blockno))%NBUFMAP_BUCKET)

struct {
  struct buf buf[NBUF];
  struct spinlock eviction_lock;

  // Hash map: dev and blockno to buf
  struct buf bufmap[NBUFMAP_BUCKET];
  struct spinlock bufmap_locks[NBUFMAP_BUCKET];
} bcache;

void
binit(void)
{
  // Initialize bufmap
  for(int i=0; i<NBUFMAP_BUCKET; i++) { initlock(&bcache.bufmap_locks[i],"bcache_bufmap");
    bcache.bufmap[i].next = 0;
  }

  // Initialize buffers
  for(int i=0; i<NBUF; i++){struct buf *b = &bcache.buf[i];
    initsleeplock(&b->lock, "buffer");
    b->lastuse = 0;
    b->refcnt = 0;
    // put all the buffers into bufmap[0]
    b->next = bcache.bufmap[0].next;
    bcache.bufmap[0].next = b;
  }

  initlock(&bcache.eviction_lock, "bcache_eviction");
}

// Look through buffer cache for block on device dev.
// If not found, allocate a buffer.
// In either case, return locked buffer.
static struct buf*
bget(uint dev, uint blockno)
{
  struct buf *b;

  uint key = BUFMAP_HASH(dev, blockno);

  acquire(&bcache.bufmap_locks[key]);

  // Is the block already cached?
  for(b = bcache.bufmap[key].next; b; b = b->next){
    if(b->dev == dev && b->blockno == blockno){
      b->refcnt++;
      release(&bcache.bufmap_locks[key]);
      acquiresleep(&b->lock);
      returnb; }}// Not cached.

  // to get a suitable block to reuse, we need to search for one in all the buckets,
  // which means acquiring their bucket locks.
  // but it's not safe to try to acquire every single bucket lock while holding one.
  // it can easily lead to circular wait, which produces deadlock.

  release(&bcache.bufmap_locks[key]);
  // we need to release our bucket lock so that iterating through all the buckets won't
  // lead to circular wait and deadlock. however, as a side effect of releasing our bucket
  // lock, other cpus might request the same blockno at the same time and the cache buf for  
  // blockno might be created multiple times in the worst case. since multiple concurrent
  // bget requests might pass the "Is the block already cached?" test and start the 
  // eviction & reuse process multiple times for the same blockno.
  //
  // so, after acquiring eviction_lock, we check "whether cache for blockno is present"
  // once more, to be sure that we don't create duplicate cache bufs.
  acquire(&bcache.eviction_lock);

  // Check again, is the block already cached?
  // no other eviction & reuse will happen while we are holding eviction_lock,
  // which means no link list structure of any bucket can change.
  // so it's ok here to iterate through `bcache.bufmap[key]` without holding
  // it's cooresponding bucket lock, since we are holding a much stronger eviction_lock.
  for(b = bcache.bufmap[key].next; b; b = b->next){
    if(b->dev == dev && b->blockno == blockno){
      acquire(&bcache.bufmap_locks[key]); // must do, for `refcnt++`
      b->refcnt++;
      release(&bcache.bufmap_locks[key]);
      release(&bcache.eviction_lock);
      acquiresleep(&b->lock);
      returnb; }}// Still not cached.
  // we are now only holding eviction lock, none of the bucket locks are held by us.
  // so it's now safe to acquire any bucket's lock without risking circular wait and deadlock.

  // find the one least-recently-used buf among all buckets.
  // finish with it's corresponding bucket's lock held.
  struct buf *before_least = 0; 
  uint holding_bucket = - 1;
  for(int i = 0; i < NBUFMAP_BUCKET; i++){
    // before acquiring, we are either holding nothing, or only holding locks of
    // buckets that are *on the left side* of the current bucket
    // so no circular wait can ever happen here. (safe from deadlock)
    acquire(&bcache.bufmap_locks[i]);
    int newfound = 0; // new least-recently-used buf found in this bucket
    for(b = &bcache.bufmap[i]; b->next; b = b->next) {
      if(b->next->refcnt == 0&& (! before_least || b->next->lastuse < before_least->next->lastuse)) { before_least = b; newfound =1; }}if(! newfound) { release(&bcache.bufmap_locks[i]); }else {
      if(holding_bucket ! =- 1) release(&bcache.bufmap_locks[holding_bucket]);
      holding_bucket = i;
      // keep holding this bucket's lock....}}if(! before_least) { panic("bget: no buffers");
  }
  b = before_least->next;
  
  if(holding_bucket ! = key) {// remove the buf from it's original bucket
    before_least->next = b->next;
    release(&bcache.bufmap_locks[holding_bucket]);
    // rehash and add it to the target bucket
    acquire(&bcache.bufmap_locks[key]);
    b->next = bcache.bufmap[key].next;
    bcache.bufmap[key].next = b;
  }
  
  b->dev = dev;
  b->blockno = blockno;
  b->refcnt = 1;
  b->valid = 0;
  release(&bcache.bufmap_locks[key]);
  release(&bcache.eviction_lock);
  acquiresleep(&b->lock);
  return b;
}

/ /...

// Release a locked buffer.
void
brelse(struct buf *b)
{
  if(! holdingsleep(&b->lock)) panic("brelse");

  releasesleep(&b->lock);

  uint key = BUFMAP_HASH(b->dev, b->blockno);

  acquire(&bcache.bufmap_locks[key]);
  b->refcnt--;
  if (b->refcnt == 0) {
    b->lastuse = ticks;
  }
  release(&bcache.bufmap_locks[key]);
}

void
bpin(struct buf *b) {
  uint key = BUFMAP_HASH(b->dev, b->blockno);

  acquire(&bcache.bufmap_locks[key]);
  b->refcnt++;
  release(&bcache.bufmap_locks[key]);
}

void
bunpin(struct buf *b) {
  uint key = BUFMAP_HASH(b->dev, b->blockno);

  acquire(&bcache.bufmap_locks[key]);
  b->refcnt--;
  release(&bcache.bufmap_locks[key]);
}

Copy the code

The results

$ bcachetest
start test0
test0 results:
--- lock kmem/bcache stats
lock: kmem_cpu_0: #fetch-and-add 0 #acquire() 32897
lock: kmem_cpu_1: #fetch-and-add 0 #acquire() 77
lock: kmem_cpu_2: #fetch-and-add 0 #acquire() 61
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6400
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6685
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6696
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 7018
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6266
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4206
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4206
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 2193
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4202
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 2196
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4359
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 4409
lock: bcache_bufmap: #fetch-and-add 0 #acquire() 6411
lock: bcache_eviction: #fetch-and-add 0 #acquire() 83
--- top 5 contended locks:
lock: proc: #fetch-and-add 397110 #acquire() 70988
lock: proc: #fetch-and-add 262715 #acquire() 70988
lock: proc: #fetch-and-add 222165 #acquire() 70987
lock: virtio_disk: #fetch-and-add 161088 #acquire() 1098
lock: proc: #fetch-and-add 45459 #acquire() 71331
tot= 0
test0: OK
start test1
test1 OK
$
Copy the code

summary

Multithreaded problems are often not as easy to find as those in single-threaded programs, and it is necessary to have a sufficient understanding of the underlying instruction level and CPU operating principle level to effectively find and solve multithreaded problems. To conclude with a few suggestions from the lecture:

don't share if you don't have to
start with a few coarse-grained locks
instrument your code -- which locks are preventing parallelism?
use fine-grained locks only as needed for parallel performance
use an automated race detector
Copy the code

Finally, my own words:

Multithreading is a pain😭, only worth it if there is non-insignificant performance increase. maybe try multi-process architecture for your next project, so you don't have to deal with all the multithreading hassles. you get the bonus of being able to scale horizontally (and almost infinitely) as well :)Copy the code