The roads I take...

KaiRo's weBlog

October 2024
123456
78910111213
14151617181920
21222324252627
28293031

Displaying recent entries in English and tagged with "CrashKill". Back to all recent entries

Popular tags: Mozilla, SeaMonkey, L10n, Status, Firefox

Used languages: English, German

Archives:

July 2023

February 2022

March 2021

more...

May 16th, 2016

Tools I Wrote for Crash (Stats) Analysis

Now that I'm off the job that dominated my life (and almost burned me out) for the last years, I finally have some time again to blog. And I'll start with stuff I actually did for that job, as I still am happy to help others to continue from where I left.

The more fun part of the stability management job was actually creating new analysis - and tools. And those tools are still helpful to people working on crash analysis or crash stats analysis now - so as my last task on the job, I wrote some documentation for the tools I had created.

One of the first things I created (and which was part of the original job description when I started) was a prototype for detecting crash "explosiveness", i.e. a detector for crashes that are rising significantly in volume. This turned out to be quite helpful for me and others to use, and the newest reports of it are listed in my Report Overview. I probably should talk about it in more detail at some point, but I did write up a plan on the wiki for the tool, and the (PHP) code is on hg.m.o (that was the language I knew best and gave me the fastest result for a prototype). I had plans to port/rewrite it in python, but didn't get to it. Calixte, who is looking after most of "my" tools now, is working on that though, and I have already promised to review his work as a volunteer so we can make sure we have this helpful capability in better code (and hopefully better UI in the end) for future use.

In general, I have created one-line docs for all the PHP scripts I had in the Mercurial repository, and put them into the run-reports script that is called by a daily cron job. Outside of the explosiveness script, most of those have been obsoleted by Socorro Super Search (yay for Adrian's work and for the ElasticSearch backend!) nowadays.

Also, the scripts that generate the summed-up data for Are We Stable Yet dashboard and graphs (also see an older blog post discussing the graphs) have been ported to python (thanks Peter for helping me to get started there) - and those are available in the Magdalena repository on GitHub. You'll see that this repository doesn't just have more modern code, using python instead of PHP and the public Socorro API instead of private PostgreSQL access, it also has a decent README documenting what it and every script in it does. :)

The most important tools for people analyzing crash stats are in the Datil repository on GitHub (and its deployment on crash-analysis), though. I used all those 4 dashboards/tools daily in the last months to determine what to report to Release Managers and other parties, find out what we need to file as bugs and/or push to get fixed. Datil, like Magdalena, has good docs right in the repository now, readable directly on GitHub.

So, what's there?
Well, the before-mentioned "Are We Stable Yet" dashboard and graphs, for sure (see the longtermgraph docs for what graphs you can get and a legend of what the lines mean).
There's also a tool/prototype for "what's important" weighed top crash lists that I called "Top Crash Score", see the score docs for what it does and examples on how to use that tool.
And finally, I created a search query comparison tools that did let me answer questions like "which crashes happen more with or without multi-process support (e10s) being active?" or "which crashes have vanished with the new beta and which have appeared (instead)?" - which was incredibly helpful to me at least. Read the searchcompare docs for more details and examples.

I probably won't spend a lot of time with those tools any more, neither in usage nor in development, but I'm still happy about people using them, giving me feedback, and I'm also happy to review and merge pull requests that feel like making sense to me!

By KaiRo, at 22:33 | Tags: analysis, CrashKill, explosiveness, Mozilla, stability | no comments | TrackBack: 0

May 4th, 2016

Projects Done, Looking For New Ones

I haven't been blogging much recently, but it's time to change that - like multiple things in my life that are changing right now.

I'll start with the most important piece first: My contract with Mozilla is ending in a week.

I had been accumulating frustration with pieces of my role that were founded in somewhat tedious routine like the whack-a-mole on crash spikes which was not very rewarding as well as never really giving time to breath and then overworking myself trying to get the needed success experiences in things like building dashboards and digging into data (which I really liked).
Being very passionate about Mozilla's Mission and Manifesto and identifying with the goals of my role I could for years paper over this frustration and fatigue but it kept building up in the background until it started impairing my strongest skill: communication with other people.

So, we had to call an end to this particular project - a role like this is never "finished", but it's also far from "failed" as I accomplished quite a bit over those 5 years, in various variants of the role.

After some cooldown and getting this out of my system, I'm happy to take on a new role of project management, possibly combined with some data analysis, somewhere, hopefully in an innovative area that aligns with my interests and possibly my passion for people being in control of their own lives.

As for Mozilla, no matter if an opportunity for work comes up there, I will surely stay around in the community, as I was before - after all, I still believe in the project and our mission and expect to continue to do so.

In other project management news, I just successfully finished the project of taking over my new condo and move in within a week. It took quite some coordination and planning beforehand, being prepared for last-minute changes, communicating well with all the different involved people and making informed but swift decisions at times - and it worked out perfectly. Sure, to put it into IT terms, there are still a few "bugs" left (some already fixed) and there's still a lot of followup work to do (need more furniture etc.) but the project "shipped" on time.

I'm looking forward to doing the same for future work projects, wherever they will manifest.

By KaiRo, at 16:51 | Tags: burnout, CrashKill, Mozilla, project management, stability, stress | no comments | TrackBack: 0

October 13th, 2015

Shortening Crash Signatures: Dropping Argument Lists

Crash signatures derived from the function on the stack are how we group ("bucket") crashes as belonging to a certain issue or bug. They should be precise enough to identify this "bucket" but also short enough so we can handle it as a denominator in lists and when talking about those issues. For some time, we have seen that our signatures are very long-winded and contain parts that make it sometimes even harder to bucket correctly. To fix that, we set out to shorten our crash signatures.

We already completed a first step of this effort in June: After we found that templates in signatures were often fluctuating wildly in crashes that belonged to the same bug, all <sometemplate> parts of crash signatures were replaced by just <T>.

That made a signature like this (from bug 1045509, the [@ …] are our customary delimiters for signatures, not really part of the signature itself though):
[@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithMemutils>::UsesAutoArrayBuffer() | nsTArray_Impl<unsigned char, nsTArrayFallibleAllocator>::SizeOfExcludingThis(unsigned int (*)(void const*)) ]

be shortended to:
[@ nsTArray_base<T>::UsesAutoArrayBuffer() | nsTArray_Impl<T>::SizeOfExcludingThis(unsigned int (*)(void const*)) ]

Which is definitely somewhat better to read and put in tables like topcrash reports, etc. - and we found it did not munge bugs together into the same signature more than previously, at least to our knowledge.


But we found out we can go even further: Different argument lists of functions (mostly due to overloading) did as far as I remember not help us distinguish any bugs in the >4 years I have been working with crashes - but patches changing types of arguments or adding one to a function often made us lose the connection between a bug and the signature. Therefore, we are removing argument lists from the signatures.

The signature listed above will turn out as:
[@ nsTArray_base<T>::UsesAutoArrayBuffer | nsTArray_Impl<T>::SizeOfExcludingThis ]


Today, we have run a script on Bugzilla (see bug 1178094) to update all affected bugs to add the new shortened signature to the Crash Signatures field without sending a ton of bugmail.

We have tested in the last weeks that Socorro crash-stats can create the new shortened signatures fine on their staging setup and that generation of the special "shutdownhang | …" signatures for browser processes that did take more than 60s to shut down and "OOM | …" for out-of-memory crashes do still work in all cases where they worked before.


As all preparation has been done, we will flip the switch on production Socorro crash-stats in the next days, and then those shortened signatures will be created everywhere.


Note that this will impede some stats that are comparing signatures across days, even though we will see to reprocess some crashes to make the watershed be at a UTC day delimiter so that as few stats as possible are disturbed by the change.


Please let me know of any issues with those changes (as well as any other questions about or issues with crash analysis), and thanks to Lars, Byron (glob) and others who helped with those changes!

By KaiRo, at 20:14 | Tags: Bugzilla, CrashKill, Mozilla, Socorro | 2 comments | TrackBack: 0

April 1st, 2014

How Effective is the Mozilla Stability Program?

One of my goals for last quarter was to get some basic metrics for the effectiveness of Mozilla's stability program. This can most easily be determined by measuring how often Firefox Desktop and Firefox for Android crash over time. Below you'll find some graphs and discussion on the data I could gather on that topic so far.

The Crash Rate

The crash rate is our primary stability measure used at Mozilla. We measure this rate in "crashes per 100 active daily installations (ADI)" or "crashes / 100 ADI". (ADI is the number of daily requests sent by Firefox Desktop and Firefox for Android to update their copy of our add-on blocklist. This value is considered a good enough estimation for usage for our purposes.)

Challenges for a Long-Term Rate

In our daily work, we tend to look at crash rates in terms of short-term changes within a single version, esp. development versions, so we can determine regressions and then dig deeper into what those are. For determining long-term program efficiency, it makes sense though to look at cross-version crash rates instead, so we know how our releases (or betas) improved. So it might make sense to look at all users on the release "channel", i.e. anyone using a stable release. On the other hand, we sometimes have leftover users of old and unsupported users producing a lot of crashes, but those are not really relevant to our current effectiveness of the stability program, so I wanted some way to age out old versions from this overall rate. To take all that into account, I needed some way to more or less "concatenate" the stability rate graphs of a series of versions. Also, people updating to or installing a release very soon after it's published tend to have somewhat different usage patterns than those installing it only after some time and therefore crash rates to those updating late in the cycle, so I needed to find some way to smoothen over that as well and ideally make this into an algorithm that can be automatically requested and put into an SQL query (as the data I base this on is in a PostgreSQL database).

Used Methodology

So, I began to think we could always sum up the crash and ADI numbers of the most recent two releases, or the ones that have the most users. But sometimes we release two adjacent versions 6 weeks apart and sometimes we do a fast update after a week and when the second of those is released, the one before might not have a lot of people updated to it yet so taking only those two might only cover a small portion of users and skew the numbers. So in the end, I decided to go with a moving window that always counts all versions where the builds have been created within the last 12 weeks for the Release channel, and the last 4 weeks for the Beta channel (I had 9 and 3 in the beginning but extended that to make numbers smooth over the impact of the 2-week hiatus we had over New Year's this year). The data we have in usable form goes back to the last few days of September 2011, so that's what I could use for the graphs (I'm trying to get some older data but that is harder to dig out).

Graphs & Discussion

So, here are some screen copies of the graphs I have created out of the data collected with that algorithm (includes data up to March 5, which was current when I originally wrote up this post):

Image No. 23207

The first graph, with data from the Firefox desktop release channel, shows three lines, as the legend says the include crashes of the browser process, those of a plugin process (the vast majority of the plugin processes are Adobe Flash), and so-called "hangs" where we kill the plugin process after it doesn't react to contact from the browser process for a long time (by default, 45 seconds).
For one thing, you'll see that weekends have higher crash rates than weekdays. This could for example be because the ADI data isn't as reliable/accurate as one would hope or because people using Firefox on weekends do things that are more crash-prone (including work/home usage pattern and possibly machine differences).
In this graph we can also clearly see the results of known stability events in this time frame: For example, it nicely shows the Google Doodle crash of August 2012, where almost every startup of the browser crashed when Google was set as the home page, and where we scrambled to get a fix out in very short time (and Google helped us by putting a workaround in their doodles as well). It's also easy to see a few other sharp spikes where we had ADI (upwards) or crash submission (downwards) issues, as well as the crash-and-hang-rich Flash 11.3 release in June of 2012 and subsequent fixes for Flash, including the concerted efforts between Adobe and us to get down to the old levels with fixes on both sides in May/June 2013. For the most time on the graph, you'll see that the browser crash rate didn't change very much (other than the sharp spikes mentioned). In January of 2013, though, it's possible to see the rise in crashes that caused us to ship Firefox 18.0.2 with a fix for that. Right following that, at the end of February, you'll see the sharp rise in crashes when we released Firefox 19.0, triggered by a bug in certain AMD CPUs, which we worked around by rebuilding and releasing a 19.0.1. Those examples, like anything showing up in that graph significantly and not being a data error has pretty intricate story, any of those could make up a separate blog post.
That said, the fact that we could keep the crash rate pretty much at 1.0 browser crashes / 100 ADI over that whole time (and even slightly improve to just below that with the Firefox 26 release in December 2013) is a statement on how effective the Mozilla Stability Program is on keeping Firefox crashes down even though a whole lot of code has been added to support a ton of new features that the web has gained over that time.

Now, let's see how Firefox Beta looks in comparison:

Image No. 23208

At the end of 2012, we apparently did manage to improve base level stability of the Beta channel, but you'll see that this channel is more noisy - which is expected as here we still see regressions and work on fixes before the issues hit release. For example, you can see that Firefox 27 Beta regressed stability in December 2013. We fixed that only very late in the cycle so that you don't see 27 being worse on the Release channel, but 28 had other regressions in the beginning and a rather large one in 28 Beta 4 (mid February 2014) - once we fixed that, you see that we come down to the 1.0 line in the last one or two weeks, so that looks pretty good for the 28 release, which was to be released ~2 weeks after the end of that data.
Also, you'll see that the plugin improvements of early 2013 are about 6 weeks earlier in Beta than in Release, which shows pretty well that there were actual patches in our code that helped with Flash hangs and crashes (as our code is on a 6-week cadence while Adobe's releases hit both channels pretty much at the same time).

Now, let's see how the picture looks when we look at a product that was newly created while we already had the mechanisms in place to record this data, like the current "native UI" Firefox for Android:

Image No. 23211

The early releases had higher crash rates, but we significantly improved over time due to our efforts in the Stability Program. You also can make out that the sharper changes happen pretty exactly at the edges of the 6-week release cycles. Also, you'll see that Firefox 23 for Android in September 2013 was pretty good but we became worse in the following months. Because of that, we started a renewed effort to improve stability of Firefox for Android this January. The current Firefox 27 for Android release is somewhat better than the one before, but it's not where we want to be yet, obviously. We didn't have too much time to pound on issues from the start of the year until 27 was release, but Beta can show us if our newer efforts are pointing in the right direction:

Image No. 23212

Now this graph looks pretty nice, doesn't it? When we started off putting this product on Beta the first time, we were seeing the usual churn of exposing a new product to a wider audience for the first time, but we burned down the issues pretty well. Then we had a big regression, fixed it, and burned down bugs slowly over multiple months again. The regressions of late 2013 look even more dramatic here as we had even worse issues there but could actually fix the worst parts of those so that the regressions on the Release channel weren't as bad as the first Betas we had there. Many of the 6-week cycles in this graph look like burn-down charts, high in the beginning, going down over the cycle as we push for bugs being fixed. It's also pretty awesome to see how the efforts since the start of this year have really paid off and current Beta is rivaling the best Beta numbers we had so far - you can imagine how I was looking forward to Firefox 28 for Android hitting Release based on that data! :)

All that said, we know there's more we can do on both products, and while holding crash rates pretty stable over a long time while adding a ton of features is awesome, we strive for improving overall stability. Those graphs are one part of measuring the effectiveness of the stability program. I hope we will be able to put them up in a more dynamic and daily updating form at some point (right now I manually construct them in LibreOffice).

And in case you're interested in digging deeper into the source of the graphs, the code to pull the data from the crash-stats DB is in my crash-report-tools repo and the JSON coming out of that and powering my charts is in my directory on crash-analysis (F*-bytype.json files). Also feel free to contact me for more details.

By KaiRo, at 19:58 | Tags: CrashKill, Mozilla, stability | 1 comment | TrackBack: 0

November 1st, 2013

Badges for Stability Contributions?

Various groups at Mozilla have been thinking about awarding Open Badges for contributors in the recent year or so (there's even a contributor badges site for easily awarding them). I again and again wonder if we could do that for stability contributions as well, but it's a tough nut to crack.



One idea would be to award badges automatically to people who leave their email in a crash report (i.e. a badge for taking part in our efforts by delivering data through crash reports) - but that would probably end up being a bad badge because you' award people for crashing Firefox (and not award people who don't see crashes). We really do not want people to search for how they can crash so that they can get a badge - after all our mission is to avoid crashes, not provoke them. So, no badges for submitting crashes (thanks to Benjamin for pointing me to that issue right away when I was rolling that idea).

So, how can we potentially award a badge for Stability/"CrashKill" work without rewarding bad behavior?

One thing I could think of was "filed x crash bugs that ended up fixed". That should be relatively easy to get out of Bugzilla data and I think there's no doubt that filing bugs that end up with a fixed resolution is a good thing.
This idea would also open up the avenue for different badges for different amounts of badges (similar to the webdev badges), or to create a badge for developers who fixed x crash bugs, and similar things.

What do you think? Which ideas do you have for awarding badges for contributions to the Mozilla Stability program?

By KaiRo, at 18:14 | Tags: CrashKill, Mozilla | 1 comment | TrackBack: 0

April 4th, 2013

15, 14, 13, 8, 7, 2 years ago, and the future? My Web Story

It all started on March 31, 1998. Just a few days off from 15 years ago.

Netscape open-sourced the code to its "Communicator" Internet suite, using its own long-standing internal code name as a label for that project: Mozilla.

I always liked the sub-line of a lot of the marketing material for this time - under the Mozilla star/lizard logo and a huge-font "hack", the material said "This technology could fall into the right hands". And so it did, even if that took time. You can learn a lot about that time by watching the Code Rush movie, which is available under a Creative Commons license nowadays. And our "Chief Lizard Wrangler" and project leader Mitchell Baker also summarized a lot of the following history of Mozilla in a talk that was recorded a couple of years ago.

Just about a year later, in May 1999, so 14 years ago, I filed my first bug after I had downloaded one of the early experimental builds of the Mozilla suite, building on the brand-new Gecko rendering engine. This one and most I filed back then were rendering issues with that new engine, mostly with my pretty new and primitive first personal homepage I had set up on my university account. After some experiments with CSS-based theming of the Mozilla suite, I did some playing around with exchanging strings in the UI and translating them to German, just to see how this new "XUL" stuff worked. This ended up in my first contribution contact and me providing a first completely German-language build on January 1, 2000.

A few months after that, in May, I submitted my first patch to the Mozilla project, which was a website change, actually. But only weeks later, I created a bug and patch against the actual Mozilla code - in June of 2000, 13 years ago. And it would by far not be the last one, even though my contributions the that code were small for years, a fix for a UI file here, a build fix for L10n stuff there. My main contributions stayed in doing the German localization for the suite and in general L10n-related issues. Even when Firefox came along in 2004, I helped that 1.0 release with some localization-related issues, esp. around localized snippets for its Google-based and -hosted start page - and stayed with L10n for the full suite otherwise (while Kadir would do the German Firefox L10n). I wrote a post in 2007 about how I stumbled into my Mozilla career.

As Firefox became rapidly successful and took an increasingly large standing in the project and community, I stuck with the suite as I liked a more integrated experience of email and browser - and I liked the richer feature set that the suite had to offer (Firefox did cut out a lot of functionality in the beginning to be able to found its new, leaner and more consumer-friendly UI). When in March of 2005, it became clear that the suite was going into strict maintenance mode and be abandoned by the "official" Mozilla project, I joined the team that took over maintenance and development of that suite - once again using a long-standing internal code name for that: SeaMonkey. In all that project-forming process 8 years ago, I took over a lot of the organizational roles, so that the coders in our group could focus at the actual code, and eventually was credited as "project coordinator" within the project management group we call the "SeaMonkey Council".

When I founded my own business 7 years ago, in January of 2006, I was earning money in surprising ways, and trying to lead the SeaMonkey project into the future. We were just about to release SeaMonkey 1.0 and convince the first round of naysayers that we actually could have the suite running as a community project. In the next years, we did quite some interesting and good work on that software, and a lot of people were finally realizing that "we made it" when we could release a 2.0 version that was based on the same "new" toolkit that Firefox and Thunderbird were built upon, removing a lot of old, cruft code and replacing it with newer stuff, including the now common-place add-ons system and automated updates among a ton of other things. I would end up doing a number of the major porting jobs from Firefox to SeaMonkey, including the places-based history and bookmarks systems, the download manager (including a UI that was similar to the earlier suite style), and the OpenSearch system. With the Data Manager, I even contributed a completely new and (IMHO) pretty innovative component into SeaMonkey. In those times, I think I did more coding work (in JS, mostly) than ever before, perhaps with the exception of the PHP-based CBSM community and content management platform I had done before that.

The longer I was in the SeaMonkey project, the more I realized, though, that the innovation I would like to have seen around the suite wasn't really happening - all the innovation to the suite came from porting Firefox and Thunderbird features and/or code, and that often with significant delay. Not sure if anything other than the Data Manager actually was a genuine SeaMonkey innovation, and I only came up with that when trying to finally get some innovation going, back in 2010. I was more and more unsatisfied with the lack of progress and innovation and the incredible push-back we got on the mailing list on every try to actually do something new. In October of 2010, I took a flight to Mountain View, California, to meet up with Mitchell Baker and talk about the future of SeaMonkey - and I also mentioned how I wanted to be more on the front of innovation even though I seem to not manage to get the SeaMonkey community there. Not sure if it came out of this or was in the back of her head before, in one of those conversations I had with her, she asked me if I would like to work for Mozilla and Firefox. I said that this caught me by surprise but we should definitely keep that conversation going. Just after that I met then-Mozilla-CEO John Lilly, and he asked if Mitchell had offered me a job - just to make sure. As you can imagine, that got me thinking a lot more about that, and gave me the freedom to think outside SeaMonkey for my future. I was at the liberty to think about my personal priorities in more depth, and it became clear that the winds of change were clearly blowing through my life.

After some conversations with people at Mozilla, I decided I wanted to try a job there, and Chris Hofmann proposed my working on tracking crashes and stability, so I started contracting for Mozilla on the CrashKill team in February 2011, first half-time, finally full-time. So, 2 years ago, I opened a completely new chapter in my personal web story. Tracking crash statistics for our products - Firefox desktop, Firefox from Android, and now Firefox OS - and working with our employees and community to improve stability has turned out to be a more interesting job than I expected when I started. Knowing that my work actually helps thousands or even millions of people, who have a more stable Firefox because of what I do, is a quite high award. And I'm growing into a more managerial role, which is something I really appreciate. And I'm connected to all kinds of innovation going on at Mozilla: A lot of the new features landing (like new JIT compilers for JavaScript, WebRTC, etc.) need stability testing and we're tracking the crash reports from those, Firefox for Android needed a lot of stability work to become the best mobile browser out there - and with Firefox OS, I was even involved in how the crash reporting features and user experience flow were implemented. I'm also involved in a lot of strategic meetings on what we release and when - an interesting experience by itself.

Where this all will lead me in the future? No idea. I'm interested in moving to the USA and working there at least for some time - not just because it would make my day cycle sane and having most or all my meetings within the confines of the actual work days in the region I'm living in, but also because I learned to like a lot that country has to offer, from Country Music to Football and many other things (not to mention Louisiana-style Cajun cuisine). I'm also interested in working from an office with other Mozillians for a change, and in possibly becoming even more of a manager. Of course, I'd like to help moving the Mozilla mission forward where I can, openness, innovation and opportunities on the web are something I stand behind nowadays more than ever - and Firefox OS as well as associated technologies promise to really make a huge impact on the web of the future. I'm looking forward to quite exciting times! :)

By KaiRo, at 00:13 | Tags: CrashKill, Firefox, future, history, L10n, Mozilla, SeaMonkey | 6 comments | TrackBack: 0

July 6th, 2012

Life Cycle of a Firefox Crash - Video and Graph

You may have seen my blog post about The Life Cycle of a Crash some time ago.

On June 14, I talked about this at a Mozilla Brown Bag, and there is a video of this presentation up on Air Mozilla.



The slides for that talk are available as well.

And here's the graph on the Firefox Crash Life Cycle I used at the end of that talk:

Image No. 22660

By KaiRo, at 16:09 | Tags: CrashKill, Mozilla, Socorro, video | 3 comments | TrackBack: 0

June 12th, 2012

Stability Brown Bags This Week

The Stability Work Week is ongoing at Mozilla Mountain View, and we have a couple of Brown Bag presentations this week. Both presentations will be streamed live on Air Mozilla and recorded for posterity.

Tuesday, June 12 at 12 noon Pacific
Location: Ten Forward, Mt. View
Mozilla CrashKill Investigation and Analysis
Presenter: Marcia Knous

Description: Stability is an important part of the Mozilla mission. Mozilla uses the Socorro tool to catch, process, and
present crash data for Desktop and Mobile Firefox, as well as other Mozilla products. The CrashKill team then takes that
data, analyzes it, detects trends, and drives the crash bugs to resolution. This presentation will give a general overview
of the crash analysis and investigation process focusing on workflow, as well as some fascinating insight into how the team
uses the Socorro tool to help keep the browser stable.

Thursday, June 14 at 2 PM Pacific
Location: Ten Forward, Mt. View
The Life Cycle of a Firefox Crash
Presenter: Robert Kaiser

Description: Unfortunately, even Firefox crashes sometimes. Everyone knows that, most of us have encountered such a case.
What many people don't know is that Mozilla cares about that and actually acts on crash reports. Take a look behind the
scenes of how that works in this presentation and find out what we are doing every day to improve stability of Firefox
(and other Mozilla products).


Join us live in Mountain View or on Air Mozilla, or watch the recorded sessions afterwards!

By KaiRo, at 19:32 | Tags: CrashKill, Mozilla | 1 comment | TrackBack: 0

May 3rd, 2012

The Life Cycle of a Crash

Someone from outside Mozilla who wants to deploy our crash reporting and statistics system asked me if I can "explain the life cycle of a crash". I found that this might be interesting for more people in- and outside of Mozilla, so here it is. ;-)

Image No. 22615

From the user point of view as well as the existence of a crash report, everything starts with the crash reporter dialog. Of course, a number of things have happened before, but as it's (supposed to be) a "life" cycle, so this point is as good as any other to start with. :)

Now a "crash" is nothing else other than an unsuspected exit of a running process - usually because execution ran into a state that was not intended and the code does not "know" how to continue executing. When this happens to a Firefox process, the routines for that unplanned exit call up the Breakpad code that fetches data like the "stack" of running function calls, the loaded libraries/modules, activated add-ons and some metadata from the crashed Firefox process, building up a crash report. In addition to that, the dialog prompts the user if the reports should be sent to Mozilla, and if so, for a comment, and email, and inclusion of the address for the currently active browser tab. On clicking any of the "Restart" or "Quit" buttons, the report is being sent (unless the checkbox for sending is deactivated).
In case the crashing process is a plugin container (the separate process Firefox runs for executing plugins), what happens is almost the same, just that the dialog window is replaced by a prompt for sending the report, which is being displayed instead of the plugin right on the web page. This prompt has no possibilities for specifying comments, email, or website addresses, so plugin reports miss those. Also, in case such a plugin container does not react to messages from the host Firefox process for a certain time (currently 45 seconds), that host process kills the plugin process and reports are being sent from the state of both the plugin and browser process, marked as "hangs" with special IDs to match them together on the server side.

Inside a running Firefox browser, the about:crashes page can be used to look up reports that have been generated by this installation. This page contains the report ID and a submission or creation date. If the ID ends up with six digits representing the submission date (in YYMMDD format), then it has been received by the servers correctly and the ID is linked to the server-side page for the report. If the ID ends in a random hex code, it was not received correctly (e.g. because of connection problems) and clicking the link triggers re-submitting it.

The server side of this system we're using at Mozilla is called Socorro. When a crash report is being submitted, it's first being received by a collector, which stores a raw dump of it and signals back to the client - this is where the date at the end of report IDs comes from (replacing the random hex sequence created by the client). Then, reports are being processed - though Socorro "throttles" this processing in certain cases to save disk space and analyzing CPU power. For release versions of Firefox desktop, usually only 10% of the incoming reports are processed, but for development/beta versions or other products, all reports are beings processed, as well as any reports including comments. The processor makes the data of the report more accessible to analysis, its main job is to make the crash stack human-readable and derive a "signature" for the crash. The original stacks in the raw reports are only combinations of names of libraries and addresses inside those, the processor connects that information via symbol files to function names as well as source file names and line numbers. Based on information of what function names are in the exit path after the crash has already been caused, the stack can be cleaned up from those unneeded frames, and based on knowledge of some less interesting function names, the top frame/function and/or some of its callers form the signature. After this treatment, the report is ready for being analyzed.

Every morning, Socorro runs jobs to create reports from the processed crash data of the last (UTC) day. For example, it counts how often a signature has been seen on that day for a certain Firefox version and saves that data to be displayed in the "topcrash" reports (internally called TCBS for "top crashes by signature", probably our most important report format).
To know what versions to produce those aggregations for, Socorro looks at Mozilla's FTP server for newly appearing builds a few times a day and automatically adds them to its list of known versions.
Those daily jobs also incorporate ADI data from the metrics team ("active daily installations", often also called ADU for "active daily users" although it's actually the number of requests to the addon blocklist that every profile should make once a day when being used), so that we can calculate "crash rates", which are displayed even at the crash-stats.mozilla.com front page in a graph to identify how well different versions are doing overall.
In addition, a number of important fields of all the processed reports of that last day for mobile and desktop Firefox are put into machine-readable files (*-pub-crashdata.csv.gz in the date-based sudirectories of the crash_analysis machine) so that some people can run their own reports on them, either for their own analysis or for prototyping functionality that later should make it into Socorro itself. One outcome of that data is also the "Are We Stable Yet?" dashboard I created for a quick Firefox stability overview.

Image No. 22617

Now that our crash report has made it far enough that it's available in reports, the CrashKill team (which I'm working in) and volunteers helping that effort come into play. We're taking a look at those reports (in Socorro as well as chofmann's and my own places for prototyping new ones) every day to make sure we catch all series of crashes that need someone to look into. As part of our mission is to ensure and improve Firefox stability, we make sure there are bug reports filed in Bugzilla for all important crash issues, work with the QA team to try to get the crashes reproduced where needed to get more data for developers to look into, try to get developers assigned to work on the problems, work with release management to get proper tracking and prioritization of highly-visible crashes as well as fixes landed in stabilization phases/branches, and coordinate with others as needed, e.g. for relations to external partners. On the latter, a number of the crashes we catch are 3rd-party issues that can't be fixed in our code, in those cases we try to work with the respective code vendor (mostly add-on creators, plugin and security suite vendors) to come to a solution, and many of them are quite responsive and fix issues on their side. In extreme cases, where Firefox becomes (mostly) unusable in combination with the 3rd-party software, we can make use of add-on or DLL blocklists that prohibit loading of that software. In cases of crashes with 3D or hardware acceleration features of graphics card drivers, we can block usage of those features with those drivers and fall back to non-accelerated and/or 2D display (except for WebGL, which needs to get deactivated in those cases). Of course, we try other avenues first before deploying one of those blocklists, and are way happier, just like our users, when we can actually find a fix in our code.

Developers sometimes take a look at Socorro themselves to find items to work on, and often start looking into a crash when a bug is being filed for which they might get bugmail - otherwise, CrashKill tries to get the right people involved and looking into the bugs. They look at the crash stack as listed in our crash data (see an example, those are the pages as linked in about:crashes as mentioned before) as well as other data submitted with the crash, and try to figure out how to avoid the problem in our code. When they find a fix, it goes through the usual review process and gets and "checked in" to the code, just like other code changes. At that point, it might make sense to apply the same fix to the Aurora and Beta channels, which are the Firefox versions that are in a testing and stabilization phase - CrashKill and release management will track those and get the needed approvals in place where warranted, so that the fix can go out to more of our users faster than most other code developments and releases become more stable.

Image No. 22616

From this code, binary builds are generated regularly - every day for the on-the-edge-of-development Nightly channel as well as Aurora, roughly every week for Beta, and every six weeks for releases (unless grave security or stability issues arise, where we'll do an updated release in between with just the fixes for those specific issues). In the building process that creates those binaries, symbol files are created next to the actual builds that are to be delivered to users and testers. Those symbol files are sent to our symbol server, where developers have access to them for debugging - and from where the Socorro processors fetch them to translate addresses in binary files into function names and source lines, as mentioned above. The builds themselves are what we deliver to the user, who now hopefully won't see this particular crash happening again, so its "life" is concluded and everybody is hopefully happier.

Though, in some unfortunate cases, a user might hit a different unexpected exit of the Firefox process, Breakpad triggers, showing the Crash Reporter dialog - and the life cycle of another crash begins...

By KaiRo, at 20:33 | Tags: CrashKill, Mozilla, Socorro | no comments | TrackBack: 1

January 4th, 2012

The Winds Of Change

Quote:
The winds of change continue blowing
And they just carry me away. -- Albert Hammond

Like many others, I've been thinking quite a bit these days about what went on last year and what will or might come up in 2012. (And I figure I should bring in a bit more from my overall personality into my future blog posts and mention or quote songs I have in my mind on a particular topic, so I'll start with that here).

One topic that has been with me throughout the year and will probably also continue to be with me is change. A lot of it started with my visit to Mozilla headquarters in Mountain View, CA, in October 2010, actually - I posted about my changing personal priorities back then. And I still remember driving my rental car up to Lake Tahoe, thinking about all those things and listening to the then-just-released Zac Brown Band album "You Get What You Give" and in particular the song "Let It Go", whose lyrics gave me the right mindset for what was I was going through and what 2011 would bring: "Save your strength for things that you can change, forget the ones you can't, you gotta let it go."

Following that, I started 2011 by transferring the vast majority of my responsibilities in SeaMonkey over to other people (we have built up a great team there over the last years, including awesome people like Callek, InvisibleSmiley, etc. - kudos to them to be able to take all that over in their free time) and get the ball rolling on making the project even more sustainable in the future (I hope we'll have news for you on that soon).

Instead, I followed another piece of advice from this song - "When the pony he comes ridin' by, you better sit your sweet ass on it" - and started contracting for Mozilla on the CrashKill team in February, first half-time, finally full-time. With that, my focus changed from SeaMonkey to Firefox and from project management to crash analysis.

For one thing, I ended up growing into that role better than I imagined at first, finding crash analysis more interesting than expected, for the other, this change ended up having more influence on my life than I had imagined. With the need to communicate a lot with different people in this job, from the CrashKill team via the Socorro team that works on the crash-stats server and which I'm coordinating with to various devs, engineering managers or release managers as the need arises in crash analysis.
Unfortunately with me being a "remotie" all communication needs to be online (or via phone) and is stripped down to the essentials needed for the job. Being a very social person, I'm missing the additional nuances that face-to-face communication would bring to the table, and more need for communication as part of the job makes that more obvious to me.
Then, the whole CrashKill team is based in Mountain View, the vast majority of the Socorro team spread across the US, and most engineering or release managers also based in Northern America, so most of that communication as well as all my meetings is happening during US working hours, which from my point of view in Europe is in the evening to night hours, which requires my work time to be mostly at the end of the day. I have been doing work at late hours in the years before, but there was not as much requirement of that before, while now I have to make at least the meetings, and should be available for more conversation on IRC at those times. Making evening appointments becomes quite difficult in that light.
And speaking of requirements, while I could basically completely make my own schedule before, I now should bring in 8 hours of work per day, and with doing that at the end of every day, I need to make all shopping and other private stuff in the afternoon, leaving me all day with "I still have a full work day to deliver today" in mind - until I achieve that and fall into bed. This causes its own share of subconscious stress.
And I'm doing all the work from my own private apartment, not getting out unless I go shopping or take my usual Monday and Tuesday evening off for some Karaoke.
So, I learned that working from home and remotely has its downsides, esp. for the kind of job I'm in there. This is one area I need to work on a lot in 2012 and find solutions that will be connected with another share of change I'm sure.

But not only my role and work life have changed - Mozilla went in a direction I had often spoken for and has changed to a rapid release cycle and started planning for that shortly after I started contracting. I commented in the planning phase and tried to help shape this process and always was convinced it was a good idea, even though we hit more road bumps than expected. I was heavily involved in coordinating to get crash-stats support rapid releases usefully and also laid out publicly how the new process can improve stability.
Mozilla also has revamped its mobile efforts completely - both with a completely new "native UI" version of Firefox for Android, which is in Aurora testing now and with a completely open mobile stack in the form of Boot To Gecko (B2G), a complete "operating system" based on the browser and open web standards (requiring new WebAPIs), which is also coming together piece by piece now.
And next to those changes, we're also working on changing how identity and logins work on the web and changing the current "silo"ed app store model by bringing open concepts for web apps and markets into the fold that easily allow decentralization and users really "owning" their apps.
In the middle of all that, Mozilla has restructured a bit, brought some previously split-off groups back into the common Mozilla fold, hired a lot of new people, lost (as employees but not as community members) a few high-profile ones who were looking for new challenges, worked on the MPL 2.0, founded exciting new initiatives like WebFWD and went stronger on marketing that we are a non-profit - clearly a lot of change happening everywhere, with the mission and the Manifesto standing unchanged and as clear as ever over all of it, though.

All this makes it clear that a lot of change has come in 2011, both to me and Mozilla, and that it's still only the seed for what's to come in the year(s) ahead. The winds of change are still blowing, and I'm excited for what they propel and which interesting experiences they drag in for all of us.

Quote:
The future's in the air
I can feel it everywhere
Blowing with the wind
Of change. -- Klaus Meine / The Scorpions

By KaiRo, at 21:26 | Tags: CrashKill, Firefox, future, history, Mozilla, SeaMonkey | 1 comment | TrackBack: 0

Feeds: RSS/Atom