<< Weekly Status Report, W33/2011 | The roads I take... | 47 >>

Why Rapid Releases Can Improve Stability

I have mentioned a few times, mostly in newsgroup discussions, that I strongly believe that the rapid release model Firefox is following now has a good chance to improve stability.

Some people without a deeper knowledge of how our new process works have at times implied that releasing way more often must make the product more unstable and worse quality than the one or two year cycles we had before. Given my multi-year experience in release management of a Mozilla product (SeaMonkey) and along with that insight into Firefox release management of the last few versions up to and including Firefox 4, my comparison of those experiences with the new model point into the exact opposite direction: Stability and quality should actually improve the more we get used to this "train" model and also the more we near the prospected user volume on the different "channels".

"Traditional" Process

Let's first look at how things worked with the old process that we used so far, including for Firefox 4: New work landed in the code for more than a year with first having only nightly testers run it every day, later alpha/beta testers running snapshots created along the way that included fixes found in some internal QA in addition to the nightly testing, but that was it for the alphas and betas - and at the point where those got shipped, we already had land the next set of feature changes on top of the code shipped there. From the view of crash analysis, this meant that we had a smaller audience of nightly testers sending crash reports we could analyze and from that see the larger and more obvious regressions from daily changes. And then there was a larger audience of beta testers that sent more data, which allowed a look at what happened with somewhat more real-world usage, but as soon as we got some good data in on those betas, the code on nightly on the way to the next beta might already have changed significantly again. With that, the most grave issues could be addressed, but sometimes it was hard to see how relevant the data from even the current beta still was. This game went on until the final betas, with increasing urgency of getting things in that should still make the release at the last second, and of course us as well as testers seeing new regressions that needed to be fixed. The criteria for accepting things into the code was being tightened up a lot towards final release, but some new feature work or invasive changes could even still be rushed into the code almost to the last minute. And the pressure was high to "get this in now or wait at least another year until users get it", so even with release drivers tightening possible changes up, some of those could still be argued for. When we shipped the final release to the really larger user audience with more than a year of piled up feature work and fixes, we very soon, usually even directly on release day or the next day, already have a list of quite visible stability problems we needed to get fixed a couple of weeks out in a stability update.

I hope you can see from this description that while we managed to control stability reasonably, the process was far from ideal for providing a product with which we could be happy in terms of stability. So when planning went into improving the processes and becoming more agile and fit for delivering features more quickly than before, a lot of thinking also went into how to make the new process give us a better story of stabilization - and I think the solution holds up pretty well.

"Rapid" Process

So, what we're doing now is getting in feature work and invasive changes into the base code and to Nightly testers almost as before, with only the difference that every such change must have an easy off-switch or be easy enough to reverse the change ("back it out") otherwise. We also still analyze crash data for this and spot major regressions there.
But with going to a next level, there comes the first major change: Every six weeks we're taking a snapshot of this Nightly code and put it on what we now call "Aurora", test it internally, disable things that are absolutely broken (as we have the off-switch/backout possibility) as found by internal QA and send it out to a somewhat larger testing audience. In the next six weeks, we are collecting data from that, reacting to user feedback and crash analysis and bringing in rather small fixes to those problems only or disable further broken features when a fix would be too invasive. We deliver the result daily to that Aurora audience in updates, getting more testing and crash data to analyze, based on the very same snapshot of code, without any more new feature or invasive work to go into it - that continues only on Nightly, no place for that in Aurora.
After those six weeks, this already fixed and stabilized snapshot is going to yet another level, which we call "Beta", and which has even more testers it's being delivered to (while Aurora picks up a new snapshot from Nightly). When the snapshot comes into the Beta phase, we have already put in six weeks of exclusively stabilization and fixing, so it is good enough for what we in earlier times probably would have called a "release candidate". It is as ready as we know at this stage as it can be - but exposing it to an even wider audience, now going into the millions, and which uses it for more normal day-to-day production work, usually turns up another class of potential problems. To deal with those, we could go and disable even more code if needed, and can apply some more small fixes, including of course crash fixes, and we deliver those to Beta testers with roughly weekly updates. Due to this being the first time this code snapshot is being exposed to a public of millions, it's usually the first time we get enough data to see some crash patterns more clearly and can get those fixed. Once again, no new feature or invasive work going into those six weeks of Beta, only disabling of problematic changes, fixing problems found in feedback and of course stability/crashes.
Having spent another six weeks in Beta, twelve weeks or three months of only fixing and stabilizing after taking the snapshot from development, and being OKed by a go/no-go meeting of release drivers, we ship this code to hundreds of millions of users as our next Firefox release (while the other snapshot moves from Aurora to Beta and yet another one is taken from Nightly into Aurora). Of course, we keep analyzing crash reports even from the release users and are able to react to large issues we haven't found before to do a fast fixup release (which we shouldn't need after looking at all the Aurora and Beta data from essentially the same code) and to smaller issues in the next round of Beta etc. before they go to being the next release.

In all this, we always have only six weeks of new development work isolated in every such snapshot (or "version") and not more than a year like previously, so pinpointing a cause gets easier. Then, we less of a rush to get a feature into a specific version as there's another one coming just six weeks earlier, so things will only go into the code in a better thought-out state. Even more, we have switches of some way we can throw to disable problematic code and give developers six more weeks to get it into shape if needed. And over all that, we have roughly three months (twelve weeks) of pure fixing and stabilization period on every snapshot/version to get problems worked out, with different sizes of testing audiences.

Of course, there are still some kinks to be worked out and the transition is not easy for everyone. Next to other concerns we've heard of some people and which belong in different forums than this particular blog entry, we have not scaled up the audiences esp. on Aurora but also on Beta up to what we want yet and therefore are not seeing as much data on them yet as we'd like to (the top crash/hang issue on Beta is typically seen by less than one in every 1000 daily users). So, there are still ways we can and need to improve things here to make it work for stability even better.

Still, having smaller sets of changes per release, no rushed landings of features and built-in calm stabilization periods of that length are all working together to improve stability, in my eyes - as long as people send in their crash reports and we continue to analyze them, of course.

Beitrag geschrieben von KaiRo und gepostet am 27. August 2011 04:03 | Tags: CrashKill, Firefox, Mozilla, release | 13 Kommentare

TrackBack/Pingback

Home of KaiRo: Weekly Status Report, W34/2011 (Pingback)

Kommentare

Seiten (2): [1] 2 >| (Beitrag 1-10/13)

Autor	Beitrag
njn	Shorter paragraphs, please! I didn't read most of that, if it had shorter paragraphs I probably would have Oh, the "accept our policy" checkbox on comments is really obnoxious. Even worse is the fact that you click on it, then hit "preview" and it gets cleared, so then when you hit "send" it says "you have to accept our policy" even though you already did. Argh. Oh, for heaven's sake, you have to answer a new captcha arithmetic question every time you hit "preview" as well? 27.08.2011 05:17
Pete	njn: Shorter Paragraphs njn, who are you that you cannot read some sentences in a row? Maybe you should start reading a novel to get some practice. Books are these books with letters only, no pictures. Pete 27.08.2011 06:37
Pete	You need at least two versions While answering to non I complete forgot to post my comments to kairos statements... Kairo, the problem is that HTML5 by itself is changing and changes between HTML4 and HTML5 need to be implemented. This means that for enterprise users a 6 week cycle simply does not work. So, Asa's ways is consistent when he only focuses on the end user. But you have to be aware to lose enterprise users. Simple as that. Further moving from the current API to Jetpack also needs a lot of changes for extension developers and a lot of frustration for the user when every six week a now popup shows up saying that extension XYZ does not work anymore. Even when three days later this extension is working again. Bottomline: When you want to go along with the rapid release cycle, you need a stable version with almost the same UI as a backup. The UI (hopefully!!!) does not change every six weeks. I can completely understand that the new process streamlines the release process but paying that with losing enterprises and extensions seems for me too expensive. 27.08.2011 06:51
Thinus aus South Africa	Why Rapid releases is a fail I just think that Mozilla just does not get it. I am a user. I don't use any any add-on except the dictionary. I have been a Firefox user since Firefox 1.0. I am about to stop using Firefox. You want to know why? Because the end user experience have become really painful. Here are a couple of examples: Apptab: Love the concept, but, since I have grouped with sub menus, I use the open all tabs feature a lot. Well, do that in an apptab and you get your bookmarks, rather than your apptab. Open a single bookmark, it jumps to a new tab, keeping your apptab. Stupid? Do you know how many times that happen? Solution, don't bother with apptabs. Opening multiple tabs at once, the browser freezes for a couple of seconds. Open a Facebook game, all tabs froze for a couple of seconds. Since Firefox 4 I have much more complete Firefox crashes. Firefox simply quit. Restarting is also painful as Firefox try to stumble back to life. One of the cool features of Firefox 3 was asking to save your tabs when exiting. Now it is gone without a easy tickbox setting to get it back. These are just some of the reasons why I dislike Firefox 4 and beyond. Rapid release have done little to improve my Firefox experience. In the end, end users don't care about all the cool tech or cutting edge HTML5. They just want a smooth and great user experience. Firefox 4, 5 or 6 is it not. So you want me to stick around, stop wasting time about defending stupid version numbers or rapid releases. Just built me a browser that works, where the driving force is the end user experience, security and stability. 27.08.2011 18:46
Maurice aus California	RAGE INDEX I agree that the rapid release can improve code quality (Although I wonder if having more versions in the pipe causes some confusion for developers when jumping between release, beta, alpha, & nightly; "is that change in this version?!?") But a sane versioning system has to come back or some designers (whose work I have mostly been impressed by) need to leave. Watching the disdain for user opinion in some controversial bugs has led me personally not to bother searching for/filing a bug because its tied to lower rights win xp accounts and administrated/business needs firefox bugs seem unwelcome. Sorry for the rant but some of us who have helped spread Firefox and are not so close to the development see a lot of regular users turning against the product for small irritations. Ironically, I believe the underlying code has improved greatly. I hope the people doing the work on this project understand that we critics are trying to sound a warning, not trying to stop all change. Hopefully I have communicated that without offending. Since you are the go to guy for Firefox stats now, perhaps you could add a "rage index" stat connected to a RAGE button in the UI! Yeah, probably not. 27.08.2011 21:29
Tony Mechelynck aus Brussels, Belgium	to njn: Captcha and checkbox Know what? As long as you are only previewing, you can do away with the arithmetic captcha and the "I accept the policy" checkbox. It's only when clicking "Send" that they have to be set "the only right way". (And I just tested it now by previewing with the captcha box empty and the checkbox unchecked: my comment appeared in preview as it was before I added this parenthese.) 27.08.2011 23:24
KaiRo Webmaster	I will not answer things about the general release process here, I am neither a release manager/driver nor a product manager nor something similar for Firefox and I don't think I will be any time soon (though I have learned to be careful with such words). I'm only doing crash analysis and working with others to improve our instruments for said analysis, and all this blog post is doing is to shed a light on the rapid release process from that point of view. 28.08.2011 03:23
EP	well Kairo, they have spoken (i'm referring to Pete & Thinus being critical of the mozilla rapid release process). those two may want to direct their complaints about the rapid release process to the Mozilla Support site. fortunately for me, I DONT have to upgrade to newer versions of Firefox & Seamonkey every 6 weeks. there's no need for me to. at least I HAVE A CHOICE. turned off automatic updating for both FF & SM on all my machines so they won't install the newer versions every six weeks. I'll just consider upgrading Firefox & Seamonkey to every other new version, meaning if I have FF5 on my machines, I'll skip FF6 and wait for FF7 to come out and upgrade to that one and ditto for SM, upgrade SM 2.2 to 2.4 and skip SM 2.3. again, I HAVE A CHOICE, which Mozilla can't take away from me. 29.08.2011 19:11
EP	although I like Wladimir Palant's idea on how mozilla's rapid release process should work. he blogged about it on the Ad Block plus web site recently: http://adblockplus.org/blog/on-rapid-releases-and-version-numbers 29.08.2011 19:24
jmdesp	Bad stability story for Thunderbird 5 Kairo, it's not a 100% success yet, at least for Thunderbird. When my mum installed Thunderbird 5, it would just crash at start every time. Searching for the cause on crash-stat, quite quickly led me to see there was a lot of complaints of users with a similar experience, and then to bug 662634 and several other associated bugs. That bug generated several different backtraces that I believe were almost all in the top crashes for Thunderbird 5, from what I've seen in the bugs I think it was responsible for at least 3 of the top 5. And whilst the fix (bug 660778) was checked in on the 11 of July, users had to wait until Thunderbird 6 for the fix which is a long time when your Thunderbird is completely broken. I think you should consider checking the users having this problem and that left a description and their mail on stack-trace, and warn them they should download and install TB 6 that will probably fix it. I think the gesture would be appreciated. 30.08.2011 21:00

Seiten (2): [1] 2 >| (Beitrag 1-10/13)

Kommentar hinzufügen