Peer Code Reviews: How Did We Do?

This is the sixth in a series of posts on our peer code review process at Loose Cannon Studios. Here’s the full series so far:
- Peer Code Reviews: the “what” and “why” of code reviews for our studio
- First Attempts: our version 1 process that didn’t go so well
- Success: the final process and tools that we ended up with
- Good Commenting Practices: some best practices for making review comments
- About Our Crucible-Perforce Bridge: about our tool that creates Crucible reviews from P4V/Win
Time to analyze how the process worked out. Did we get the benefits we expected? Did we run into the problems we predicted we would? What new things came up that surprised us? Or did it all just devolve into a nit-picky passive-aggressive waste of time?
I’ll compare the last twelve months, where we were doing reviews as per my “success” post, with the six months before it, where almost no reviews were being done and it was the wild west as in the “first attempts”.
Code Reviews Strongly Recommended
Given the number of posts I’m doing on this subject, it should be obvious that I consider our peer code review process to be one of the most vital and successful things we’re doing at Loose Cannon. In fact, I’ve come to believe that code reviews should be a part of every studio’s process.
Not the way we’re doing it, necessarily. Our process was the result of planning by senior team members plus a lot of ongoing feedback from the team to keep it relevant and effective. So it’s necessarily specific to our culture, personalities, and projects. Another studio may need different tools. Or they may ditch the primary reviewer group, or go with pair programming instead.
But I think every studio needs some kind of code review process. I’d do it even with a team size of two.
Revisiting Our Goals
First things first. Let’s revisit our original goals and see if we hit them. Did code reviewing live up to its promises? The short answer is a definite “yes”. Here’s the list from the first post:
- Share Knowledge
- Catch And Correct System Misuse
- Raise The General Quality Of Code
- Mentor Junior Engineers
- Educate About And Enforce Standards
In all of these areas, we had significant improvement.
Share Knowledge
This was a clear win, but there are tradeoffs that highlight the differences between using a real-world conversation and a virtual-world tool. You could say that, ideally, sharing knowledge is best done face-to-face. In person, you get a tight feedback loop that quickly cuts off irrelevant discussion. And obviously speaking and listening are a lot faster than typing and reading, so you can cover a lot more ground.
Yet I already covered how, in our first code review attempts, face-to-face reviews had logistical issues. But more importantly, in the context of knowledge sharing, the spoken word doesn’t get a permalink. None of it gets captured. It can’t be referenced nor linked. It can’t be adapted into a wiki page. It has no recorded history and can’t be searched. It’s inherently private, not public, so we have a lot of duplication, on top of the warping that happens when it becomes the “telephone game”. It gets more difficult and time-consuming the more people that are involved. Lots of problems.
Once the conversation is over, it can only live on as tribal knowledge.
With a code review tool, every comment gets a permalink and is publicly viewable and searchable. Everything gets captured by design. And it’s captured directly in context, right with the lines of code that are being discussed.
I saw many examples of knowledge sharing in our comments. Most of it was casual – throwing out a bit of info about how a function works or what a particular order of calls should be. I gave some examples of these in my post on Good Commenting Practices. But often it was more structured. The reviewer knew that there was an opportunity to explain in detail some nuanced part of a system and took the time to write a fuller description. A frequent reply I saw was “oh! I didn’t know you could do that!” Cool.
Reviewees often pre-commented their reviews with questions. Even though they already had a working, tested change, they wanted to know more about the systems they were using. This is great – so often I see engineers not caring much about how a system works, just wanting to hit “.” and flip through the Intellisense list to look for a relevant function to call. In a review they’re able to ask questions when it’s already on their minds, in context. This happens much less often in casual conversation!
Sharing went the other way as well. As I reviewed systems under development I learned more about how they worked. Through the small snapshot of a single change I wouldn’t usually understand much. But over time, as I watched things evolving, and reviewed different parts, I eventually got a pretty good idea of how parts of the game worked that I normally would have little contact with. I also learned how different engineers tended to do things, and learned some new techniques along the way. When it came time for me to go into any given system to help debug it at the end of the project, I usually had some familiarity with it and knew what I was getting into.
Catch And Correct System Misuse
This is a special case of Share Knowledge, and was clearly improved with the new process. In particular, the ability to easily add multiple extra reviewers for their domain-specific knowledge was a key advantage. There are many examples in our review database of a domain expert chiming in on something like where to put new system-wide Nant properties, or how a particular change to a state machine isn’t taking into account some dependency. People also got comfortable bringing in new reviewers with domain expertise and directing them to some extra-grognardy part of their change for a focused review.
It was also a good way to detect where we had bad API design. This was something that seems obvious but I wasn’t expecting. If people are continuing to get confused and doing the wrong thing across multiple reviews when using a given system, it means there’s something very wrong with it.
For example, a couple years ago I wrote a set of four array classes, each meant to be used in a very specific way. After doing a lot of reviews involving usage of these classes, it seems that most people still don’t understand the difference among them despite heavy documentation and regular reminders in reviews. Time to rewrite. I get a much stronger message that way than I would hear in casual conversation.
Raise The General Quality Of Code
This is an area I saw huge improvement. It really was the wild west before we started doing reviews again. Everybody had a different style and used different patterns, and the code looked like a mess. Parts are still a mess now, of course, but I feel like we’re mostly rowing the boat in the same general direction.
Further than just stylistic improvement, the overall quality of changes went up. I seriously think this is just the peer pressure at work here. If you know that someone is going to be seriously looking over your code and not just complaining when it crashes the game, you will take more time to get it right before considering it ready to go. As a user of code, I’d normally only care about the public interface to it and its performance characteristics. As long as it is self documenting and works as it should, I shouldn’t care a whole lot what’s going on inside. But as a reviewer, I’m looking deep inside the implementation, and the reviewee knows it.
This also lets me more easily become a maintainer of the code in the future if necessary, like if the owner goes on vacation. Towards the end of a project, we all tend to jump from system to system, fixing all kinds of random bugs wherever we can help out. Being suddenly dumped into working on someone’s crazy implementation code that I previously hadn’t been caring about is frustrating. It’s so much better to have been familiar with it already through regular reviews.
Mentor Junior Engineers
I mentioned earlier in this article that knowledge sharing may be done best face-to-face, but in the case of mentoring junior engineers I think the written form is probably better. At minimum a set of written comments should be the start of a conversation. It lets you clearly define what the new technique is that you’re trying to show to them, and it’s recorded so they (or anybody else) can refer back to it whenever they need. Having it in text form lets the junior engineer really take their time in absorbing the knowledge and thinking about it. They can try out the unfamiliar technique and get to know it a little before responding to find out more information or perhaps argue about it.
Speaking of which, being written makes it easier for engineers to challenge their mentors in a less confrontational way. As we all have learned with the Internet, people are willing to say things in text that they wouldn’t in person. So if someone doesn’t buy an argument, it’s a lot easier to state that as a reply to a comment in a review than by directly challenging them on it. I often prefer the direct-challenge approach myself, but lots of folks aren’t like this. You could expect this to lead to passive-aggressive discussions but from what I’ve seen so far in our reviews this hasn’t been a significant issue.
Educate About And Enforce Standards
Given that we didn’t have coding standards until the new review process was in place, I can’t compare this with the pre-review period. But with the new process, I noticed a few things:
- New code written by different people started to all look similar. People were adhering to standards. This was awesome. During the wild west time, we had at least four drastically different styles in play.
- New engineers who came on board and had trouble adjusting did it a lot quicker than I’ve seen in the past due to the regular reminders they got in reviews.
- Weak or ambiguous points in the coding standards were found very quickly. Reviewees often pointed these out, which then got escalated and either resolved or postponed for later discussion.
- Areas where we needed to decide on new standards were found quickly. And those new rules got out to the rest of the team quicker and more effectively through reviews than just by email.
We also realized that we needed best practices on top of basic coding standards. What’s the standard naming for resource allocation/release functions? When and how is it appropriate to assert state? How should we lay out platform-common and platform-specific code for a given class or system? Recently, this has expanded into big discussions about our overall engineering design process. I plan to post on this topic in the future as we figure things out.
Comments related to standards compliance varied a lot among reviews and reviewers. Some reviewers spotted violations right away, and others will miss 95% of them. It didn’t matter when things were missed. Eventually, with enough reviews, people would start coding to the standard. I think the mix of attention levels helped in keeping things from getting too nit-picky.
Towards the end of the project we altered our process to just stop bothering with comments about coding standards. None of that code (usually certification-related) was going to be brought forward to the next game and we wanted to really focus on getting the game shipped.
Version 1 Problems Fixed?
Back in the first attempt, I identified three critical problems that contributed to the death of the original process. So how did we fare on these in the final process?
Mini-Cliques
This was primarily a physical location issue that disappeared when we switched to using Crucible and started working mostly in the virtual world.
At one point, though, one of the primary reviewers noticed that they were never getting reviews from a couple engineers on the team. These engineers were overriding the random automatic choice of crucreate and always going with the same reviewer. In both cases, it was a misunderstanding that was easily resolved.
I ran some stats to see what the spread was for each engineer and everybody else was ok, spreading their reviews out pretty well. But it did bring up some enhancements I’d like to make to crucreate’s chooser dialog when showing potential reviewers. Just a little more info to help the random chooser and give more information to the reviewee:
- What percent of your reviews has each reviewer been on in the last 30 days?
To prevent favoring a reviewer too much by accident. - What is the current review load of each reviewer?
To prevent overloading a reviewer who already has a big queue. - Is the reviewer active at their workstation?
To prevent people forgetting when someone’s taken vacation or is out to a doctor appointment or whatever. We can detect this from their idle status on the Jabber server.
It would be fun to implement those features, but it’s admittedly low priority stuff.
Lack of Simultaneous Availability
This is another physical issue that was mostly solved by doing reviews in the virtual world. I say “mostly” because obviously it doesn’t change the situation when you need to ask questions and discuss a change in person.
In fact, it’s sometimes worse in these cases. Unless someone is in a rush to check in, they usually don’t pay any attention to a reviewer being available at that moment. There’s no need, and furthermore it’s a major benefit of the process. So nobody’s thinking about timing all that much.
The problem comes up when a reviewer needs to discuss a change in person, and the reviewee just took off. Now you have to do the scheduling time thing to get together. It’s not a big deal, all that happens is the review is delayed. But as a reviewer it’s always frustrating when I’m down in the middle of a big review, and have a small question, and the reviewee is nowhere to be found.
I’ve dealt with this a couple ways. As a reviewer, I can spend a minute skimming the review to get an idea of its complexity. If it’s going to be a tough one, I can check on the availability of the reviewee before digging in over IM. “This is going to be a tricky review – you around for the next hour or so in case I need to ask you anything?” Or as a reviewee, I usually know in advance if a review will need discussion or not. If it will, I’ll avoid sending it out if I’m about to leave the office. Or I’ll add a general comment at the top saying “this might be tricky, we may need to discuss before you get too deep, let me know” kind of thing.
The Blind Leading the Blind
We handled this in a couple of ways that I think worked: the primary reviewers group, and adding multiple reviewers.
Requiring a primary reviewer means you’re guaranteed a more senior person. And the ability to have multiple reviewers working on a review simultaneously means it’s easy to add in people when you don’t know what you’re doing in a particular area. This helps out significantly with the blind leading the blind.
There’s also accountability with the system, so you can trace back problems and get them resolved. With in-person reviews the entire conversation is lost due to being all verbal. But with our process, every change has an associated review. It’s easy enough to find out who reviewed a change that caused big problems and “review the reviewer”.
Special Thanks
I want to give some extra props to Atlassian. It took us many months to really settle on Crucible as our tool of choice and then pay for it and get a real license. We had a few false starts and it took a while to give it a real shot. During this time, Atlassian had no problem continuing to renew our evaluation licenses (as they did previously with Jira and Confluence).
This was great because it let us get the experimentation and testing we needed, not to mention the budget. 30 days is a really short time to check out a workflow tool, particularly one of this magnitude for us. I wish everybody had such a liberal policy with their evals. Not to mention using a simple license key and not requiring some obnoxious DRM licensing server or annoying things like IP address locking like so much other software we use. So hats off to them.
Series
My full series on code reviews:
- Part 1: Peer Code Reviews At Loose Cannon
- Part 2: Peer Code Reviews: First Attempts
- Part 3: Peer Code Reviews: Success!
- Part 4: Peer Code Reviews: Good Commenting Practices
- Part 5: About Our Crucible-Perforce Bridge
- Part 6: Peer Code Reviews: How Did We Do?


Just wondering what is the reason for moving the code review process into the virtual world with Crucible?
I would think that a code review face to face, 2 engineers staring at the same screen talking to each other would be better simply because of the face to face communication going on.
If it was because cliques were forming like your wrote about in the post, then could you not just tell the engineer to make sure dont’ always grab the same person for the review?
The advantage I see in doing the code review offline is that the review is documented and can be referred back again in future. But this could be addressed by writing a simple summary for the comments in the change list before p4 check in.
We’ve practiced 2 engineers to 1 screen code review I find the face to face communication during the review works really well – bugs could be found quickly, small discussions, knowledge transfer and mentoring works well in this situation.
Jason Chionh
3 Sep 09 at 11:32 pm
I thought I covered pretty well why using a tool was generally superior to face-to-face for us. But I can talk about a few points you bring up.
Limiting to one reviewer is a problem. We frequently include more than that on reviews. Getting that to work in person is just too hard.
Cliques can’t be legislated away. People are people and they’re going to do what is most convenient. Yeah, we’re all professionals, but when you come right down to it, you can’t make rules to govern social relationships. In addition, I just can’t get past the problem of synchronizing schedules and mental spaces. Only by using a tool and permitting asynchronous, offline reviews can you run the “80% case” (mostly simple reviews) efficiently.
We of course use face-to-face reviews for the big stuff. Though we still use a review to at least track the questions that came up in order to do the face-to-face review.
As for documentation, “writing a simple summary” is just not going to work, nor will it be anywhere near complete. Engineers aren’t going to do that reliably or consistently. It will be the first thing to fall away when we get busy and never get picked up again. The documentation must be a core part of the process or it will just feel like a waste of time. Which means it will get done poorly and incompletely. Having it all in the tool to begin with guarantees that you have 100% of the comments documented.
Scott
10 Sep 09 at 9:46 am
Regarding face to face versus ‘offline’ reviews:
We do face to face check in ‘buddying’ (all changes in a CL must be described/explained to a fellow programmer prior to a P4 submission) and it’s sort of a half-way house between no / a full review. It’s effective for picking up simple mistakes/omissions/oversights but that’s often where it ends.
When sitting down and reviewing a change list, folk usually point out things like:
* “you’ve duplicated functionality that already exists in one of our other libs”
* “extract this into a sub-method and get rid of the duplicate logic”
* “Change this magic number to a well named const”
These typically come under the umbrella of code cleanliness and common conventions. If someone stacks up a lot of changes I am more inclined to read their code more thoroughly before putting my name to it, but it’s tough to buddy each and every change list with a high level of scrutiny, especially when a deep discussion can take hours and energy levels start flagging.
Some people are very good at being stringent in buddying situations (I’m always grateful when someone ‘calls bullshit’ on my stuff), whereas others just want to quickly muddle through it and get on with their work.
Personality types come into play; for some, the to and fro’ing of buddying feels more like a sparring session than a useful exercise and it can become a style/opinion stalemate. For those reason, we usually keep the buddy process pretty light — I don’t think scaling it up any more would work.
When reviewing is done separately (or ‘offline’, whatever you want to call it) with multiple reviewers, it gives a good mix of programmers (from different groups, abilities, specialities etc.) time to think (I view this as ‘catching the stuff the buddy missed’). It also helps the less vocal (or more vocal
) members of the group participate without butting heads mano-a-mano.
Formal code reviews (which ‘offline’ reviews most closely resemble) also have a lot of published studies backing them up. Numerous studies have shown that code reviews are more effective than any of manual/unit/integration/functional testing for finding defects. There is also a strong correlation between lines of code read per hour and the number of defects found. If you read too quickly, you find fewer bugs. If you read for a small amount of time, you find less bugs than if you read for a larger amount. If you read the code for too long, you reach a point of diminishing returns.
In my opinion, you cannot realistically review code on someone’s screen while simultaneously talking to them and find defects in the same way — everyone moves at a different pace and the atmosphere is totally different. In a one to one discussion, it’s easier to derail the discussion or become fixated with specific aspects.
My only concern with our review system is that it happens after the changes go in, but I’m not sure that we can move to a thorough pre-checkin review. As Scott said, once it’s in, it’s in. Prevention is better than the cure, and once code is checked in and used, it hardens much more quickly and sucks up a lot more time than the writing phase ever consumed. E.g. someone ninja’ed some rebadged legacy code into a common namespace (probably a bad merge or a misunderstanding) and, before it was picked up, it’d already hit our main branch and been used by various people! Small change, lots of knock-on effects.
Out of interest, what sort of granularity were the change lists you guys were reviewing? Are we talking daily change sets containing a handful of files with minor changes, or entire features that took a week to create?
If you had wildly varying sizes of change lists to check over, how did you find it affected the process? E.g. did folk get blocked while waiting to check in an important (large-ish) feature for a day or two while key reviewers were busy? If the changelists were small and frequent, did the overhead of performing numerous small reviews cause any hassle?
Finally, did you have any code review guidelines in terms of how long you should spend on x lines of code, or how to pay more or less attention to a change list?
Sorry for the long reply, but this post piqued my interest.
Cheers,
-Mark
Mark
25 May 10 at 4:11 pm
Wow, I really let this comment sit without a reply for a long time. I’m sorry about that.
So, moving to a pre-commit review system is a tough sell, especially as the team size grows. My current studio is pretty big and I wouldn’t even consider rolling it out to the full team. I’d propose it just on a very small scale, just within my immediate group, then grow it from there with some evangelism.
But I don’t think we’ll ever really be able to let go of the frustration with “I know this works and I want to check it in now so Bob can use it” yet having to do a review before checkin. I think this frustration is actually a good thing, though. It keeps the machine moving with little nags.
Regarding granularity – we reviewed every changelist, required before checking in. Sometimes with emergency fixes someone would check in and we’d do a post-review while the build was running or whatever but that was rare. I occasionally ran an audit to make sure of this (just a script that made sure every changelist description had a review link in it).
Over time I tried to encourage people to break up their changelists into functional, atomic packets. Instead of a giant 30-file feature change, a few smaller ones broken down into core lib changes, service changes, and feature change is better. It’s better for post-hoc diagnosis of problems down the road, and it’s better for review discussion.
Sometimes we’d still have a giant changelist but have it broken into several separate reviews just for reviewer convenience. While that checkin to P4 must be atomic, so you do not break the build, the review service doesn’t have that requirement. So I would often break out the reviews I created into “trivial crap you can probably skim”, “ok here’s the real change”, and (sometimes) “pay careful attention to this stuff”.
We could end up with reviews of any size, though they were typically in the <5 file range if I remember right. Sometimes it would be a one-line change, and sometimes it would be a new system created over a week or two. Obviously the stage in development would affect this.
As we started our next project, I experimented a bit with no-checkin reviews to get early feedback on new systems. So: just create a review with the headers and some key cpp files, and send it around to the affected parties to see what they think. So, using it as a discussion tool, with no intention of checking anything in.
But when it comes to really big monolithic checkins of “here’s my new system, all done!” that is going to happen on occasion, then that’s only going to work by starting with a sit-down review for the lay of the land, leading to offline for the more detailed review work, with followup as you go, either in IM or in person. Or by writing up questions in comments and having a dialogue in there.
How did this affect the process? Well, obviously it slowed it down, but again I think that’s a good thing. Nobody should be in a hurry to check in a massive changelist with confidence. We varied a lot in our return time on reviews, but typically it was no more than half a day, probably two days at worst if the change was huge, had a lot of comments, and the reviewers were slammed with their own work. There were problems on occasion but overall the timing worked out well enough. I also think people have a responsibility to be able to work on multiple things at once, using multiple source control clients. Engineers shouldn’t work in serial. There’s just too much we always have to do. So nobody should ever be blocked. And I don’t think I saw it happen at all at my last gig.
Performing lots of small reviews was really no more overhead than a larger batched review. Just a couple more clicks. I actually liked smaller reviews better because they more clearly separated the changes from each other. And everyone gets more of a sense of accomplishment as they check off reviews. Feels like real forward progress to check those boxes. Though obviously I wouldn’t use any of these counts for any kind of employee performance metrics.
And finally, regarding code review guidelines of time spent, we didn’t do anything like this. Our informal standard was just “do a good job”. We clearly had different review styles. I got a reputation for spending a lot more time on tiny details than others, for example. That’s why we want to spread the load around, and why I made the tool choose someone at random by default on review creation. Everyone’s going to look for and catch different things, and while an individual review won’t get total coverage, over time we will get it, due to the overlap among all the reviewers.
Scott
20 Jun 10 at 6:53 pm
Thanks for the reply and for taking the time to answer my questions — I appreciate it
Cheers,
-Mark
Mark
21 Jun 10 at 10:47 am
I really appreciate you sharing your experiences with finding a system that works. We’re in the process now of trying to get a code-review system up and running and I expect there will be a few bumps in the road before we find something that works and that everyone’s happy with.
One thing I found interesting was the idea of doing code-review pre-checkin, and your reasoning makes sense. Unfortunately I think it would be a hard-sell here. We’re slowly adopting an “agile” programming style, which means constantly checking in code which is tested on an almost daily basis by people within the company. Our software lead is adamant that all developers review all code, so requiring that step be done pre-checkin would hold things up an unacceptable amount I think. The idea – at present – is to complete an agile “story”, have it checked in and tested by Q.A., and at the same time do a code review.
That being said, this is all new to us. Just like your “first attempts”, we may give it a go and find it just doesn’t work at all. But being able to see what your company have done and how you’ve adapted over time is greatly helpful! Thanks!
Bryn
25 Aug 10 at 3:42 pm