I have mentioned a few times, mostly in newsgroup discussions, that I strongly believe that the rapid release model Firefox is following now has a good chance to improve stability.
Some people without a deeper knowledge of how our new process works have at times implied that releasing way more often must make the product more unstable and worse quality than the one or two year cycles we had before. Given my multi-year experience in release management of a Mozilla product (SeaMonkey) and along with that insight into Firefox release management of the last few versions up to and including Firefox 4, my comparison of those experiences with the new model point into the exact opposite direction: Stability and quality should actually improve the more we get used to this "train" model and also the more we near the prospected user volume on the different "channels".
Let's first look at how things worked with the old process that we used so far, including for Firefox 4: New work landed in the code for more than a year with first having only nightly testers run it every day, later alpha/beta testers running snapshots created along the way that included fixes found in some internal QA in addition to the nightly testing, but that was it for the alphas and betas - and at the point where those got shipped, we already had land the next set of feature changes on top of the code shipped there. From the view of crash analysis, this meant that we had a smaller audience of nightly testers sending crash reports we could analyze and from that see the larger and more obvious regressions from daily changes. And then there was a larger audience of beta testers that sent more data, which allowed a look at what happened with somewhat more real-world usage, but as soon as we got some good data in on those betas, the code on nightly on the way to the next beta might already have changed significantly again. With that, the most grave issues could be addressed, but sometimes it was hard to see how relevant the data from even the current beta still was. This game went on until the final betas, with increasing urgency of getting things in that should still make the release at the last second, and of course us as well as testers seeing new regressions that needed to be fixed. The criteria for accepting things into the code was being tightened up a lot towards final release, but some new feature work or invasive changes could even still be rushed into the code almost to the last minute. And the pressure was high to "get this in now or wait at least another year until users get it", so even with release drivers tightening possible changes up, some of those could still be argued for. When we shipped the final release to the really larger user audience with more than a year of piled up feature work and fixes, we very soon, usually even directly on release day or the next day, already have a list of quite visible stability problems we needed to get fixed a couple of weeks out in a stability update.
I hope you can see from this description that while we managed to control stability reasonably, the process was far from ideal for providing a product with which we could be happy in terms of stability. So when planning went into improving the processes and becoming more agile and fit for delivering features more quickly than before, a lot of thinking also went into how to make the new process give us a better story of stabilization - and I think the solution holds up pretty well.
So, what we're doing now is getting in feature work and invasive changes into the base code and to Nightly testers almost as before, with only the difference that every such change must have an easy off-switch or be easy enough to reverse the change ("back it out") otherwise. We also still analyze crash data for this and spot major regressions there.
But with going to a next level, there comes the first major change: Every six weeks we're taking a snapshot of this Nightly code and put it on what we now call "Aurora", test it internally, disable things that are absolutely broken (as we have the off-switch/backout possibility) as found by internal QA and send it out to a somewhat larger testing audience. In the next six weeks, we are collecting data from that, reacting to user feedback and crash analysis and bringing in rather small fixes to those problems only or disable further broken features when a fix would be too invasive. We deliver the result daily to that Aurora audience in updates, getting more testing and crash data to analyze, based on the very same snapshot of code, without any more new feature or invasive work to go into it - that continues only on Nightly, no place for that in Aurora.
After those six weeks, this already fixed and stabilized snapshot is going to yet another level, which we call "Beta", and which has even more testers it's being delivered to (while Aurora picks up a new snapshot from Nightly). When the snapshot comes into the Beta phase, we have already put in six weeks of exclusively stabilization and fixing, so it is good enough for what we in earlier times probably would have called a "release candidate". It is as ready as we know at this stage as it can be - but exposing it to an even wider audience, now going into the millions, and which uses it for more normal day-to-day production work, usually turns up another class of potential problems. To deal with those, we could go and disable even more code if needed, and can apply some more small fixes, including of course crash fixes, and we deliver those to Beta testers with roughly weekly updates. Due to this being the first time this code snapshot is being exposed to a public of millions, it's usually the first time we get enough data to see some crash patterns more clearly and can get those fixed. Once again, no new feature or invasive work going into those six weeks of Beta, only disabling of problematic changes, fixing problems found in feedback and of course stability/crashes.
Having spent another six weeks in Beta, twelve weeks or three months of only fixing and stabilizing after taking the snapshot from development, and being OKed by a go/no-go meeting of release drivers, we ship this code to hundreds of millions of users as our next Firefox release (while the other snapshot moves from Aurora to Beta and yet another one is taken from Nightly into Aurora). Of course, we keep analyzing crash reports even from the release users and are able to react to large issues we haven't found before to do a fast fixup release (which we shouldn't need after looking at all the Aurora and Beta data from essentially the same code) and to smaller issues in the next round of Beta etc. before they go to being the next release.
In all this, we always have only six weeks of new development work isolated in every such snapshot (or "version") and not more than a year like previously, so pinpointing a cause gets easier. Then, we less of a rush to get a feature into a specific version as there's another one coming just six weeks earlier, so things will only go into the code in a better thought-out state. Even more, we have switches of some way we can throw to disable problematic code and give developers six more weeks to get it into shape if needed. And over all that, we have roughly three months (twelve weeks) of pure fixing and stabilization period on every snapshot/version to get problems worked out, with different sizes of testing audiences.
Of course, there are still some kinks to be worked out and the transition is not easy for everyone. Next to other concerns we've heard of some people and which belong in different forums than this particular blog entry, we have not scaled up the audiences esp. on Aurora but also on Beta up to what we want yet and therefore are not seeing as much data on them yet as we'd like to (the top crash/hang issue on Beta is typically seen by less than one in every 1000 daily users). So, there are still ways we can and need to improve things here to make it work for stability even better.
Still, having smaller sets of changes per release, no rushed landings of features and built-in calm stabilization periods of that length are all working together to improve stability, in my eyes - as long as people send in their crash reports and we continue to analyze them, of course.