The roads I take...
KaiRo's weBlog
| Zeige Beiträge veröffentlicht am 03.05.2012 und auf Englisch an. Zurück zu allen aktuellen Beiträgen |
3. Mai 2012
The Life Cycle of a Crash
Someone from outside Mozilla who wants to deploy our crash reporting and statistics system asked me if I can "explain the life cycle of a crash". I found that this might be interesting for more people in- and outside of Mozilla, so here it is.
From the user point of view as well as the existence of a crash report, everything starts with the crash reporter dialog. Of course, a number of things have happened before, but as it's (supposed to be) a "life" cycle, so this point is as good as any other to start with.
Now a "crash" is nothing else other than an unsuspected exit of a running process - usually because execution ran into a state that was not intended and the code does not "know" how to continue executing. When this happens to a Firefox process, the routines for that unplanned exit call up the Breakpad code that fetches data like the "stack" of running function calls, the loaded libraries/modules, activated add-ons and some metadata from the crashed Firefox process, building up a crash report. In addition to that, the dialog prompts the user if the reports should be sent to Mozilla, and if so, for a comment, and email, and inclusion of the address for the currently active browser tab. On clicking any of the "Restart" or "Quit" buttons, the report is being sent (unless the checkbox for sending is deactivated).
In case the crashing process is a plugin container (the separate process Firefox runs for executing plugins), what happens is almost the same, just that the dialog window is replaced by a prompt for sending the report, which is being displayed instead of the plugin right on the web page. This prompt has no possibilities for specifying comments, email, or website addresses, so plugin reports miss those. Also, in case such a plugin container does not react to messages from the host Firefox process for a certain time (currently 45 seconds), that host process kills the plugin process and reports are being sent from the state of both the plugin and browser process, marked as "hangs" with special IDs to match them together on the server side.
Inside a running Firefox browser, the about:crashes page can be used to look up reports that have been generated by this installation. This page contains the report ID and a submission or creation date. If the ID ends up with six digits representing the submission date (in YYMMDD format), then it has been received by the servers correctly and the ID is linked to the server-side page for the report. If the ID ends in a random hex code, it was not received correctly (e.g. because of connection problems) and clicking the link triggers re-submitting it.
The server side of this system we're using at Mozilla is called Socorro. When a crash report is being submitted, it's first being received by a collector, which stores a raw dump of it and signals back to the client - this is where the date at the end of report IDs comes from (replacing the random hex sequence created by the client). Then, reports are being processed - though Socorro "throttles" this processing in certain cases to save disk space and analyzing CPU power. For release versions of Firefox desktop, usually only 10% of the incoming reports are processed, but for development/beta versions or other products, all reports are beings processed, as well as any reports including comments. The processor makes the data of the report more accessible to analysis, its main job is to make the crash stack human-readable and derive a "signature" for the crash. The original stacks in the raw reports are only combinations of names of libraries and addresses inside those, the processor connects that information via symbol files to function names as well as source file names and line numbers. Based on information of what function names are in the exit path after the crash has already been caused, the stack can be cleaned up from those unneeded frames, and based on knowledge of some less interesting function names, the top frame/function and/or some of its callers form the signature. After this treatment, the report is ready for being analyzed.
Every morning, Socorro runs jobs to create reports from the processed crash data of the last (UTC) day. For example, it counts how often a signature has been seen on that day for a certain Firefox version and saves that data to be displayed in the "topcrash" reports (internally called TCBS for "top crashes by signature", probably our most important report format).
To know what versions to produce those aggregations for, Socorro looks at Mozilla's FTP server for newly appearing builds a few times a day and automatically adds them to its list of known versions.
Those daily jobs also incorporate ADI data from the metrics team ("active daily installations", often also called ADU for "active daily users" although it's actually the number of requests to the addon blocklist that every profile should make once a day when being used), so that we can calculate "crash rates", which are displayed even at the crash-stats.mozilla.com front page in a graph to identify how well different versions are doing overall.
In addition, a number of important fields of all the processed reports of that last day for mobile and desktop Firefox are put into machine-readable files (*-pub-crashdata.csv.gz in the date-based sudirectories of the crash_analysis machine) so that some people can run their own reports on them, either for their own analysis or for prototyping functionality that later should make it into Socorro itself. One outcome of that data is also the "Are We Stable Yet?" dashboard I created for a quick Firefox stability overview.
Now that our crash report has made it far enough that it's available in reports, the CrashKill team (which I'm working in) and volunteers helping that effort come into play. We're taking a look at those reports (in Socorro as well as chofmann's and my own places for prototyping new ones) every day to make sure we catch all series of crashes that need someone to look into. As part of our mission is to ensure and improve Firefox stability, we make sure there are bug reports filed in Bugzilla for all important crash issues, work with the QA team to try to get the crashes reproduced where needed to get more data for developers to look into, try to get developers assigned to work on the problems, work with release management to get proper tracking and prioritization of highly-visible crashes as well as fixes landed in stabilization phases/branches, and coordinate with others as needed, e.g. for relations to external partners. On the latter, a number of the crashes we catch are 3rd-party issues that can't be fixed in our code, in those cases we try to work with the respective code vendor (mostly add-on creators, plugin and security suite vendors) to come to a solution, and many of them are quite responsive and fix issues on their side. In extreme cases, where Firefox becomes (mostly) unusable in combination with the 3rd-party software, we can make use of add-on or DLL blocklists that prohibit loading of that software. In cases of crashes with 3D or hardware acceleration features of graphics card drivers, we can block usage of those features with those drivers and fall back to non-accelerated and/or 2D display (except for WebGL, which needs to get deactivated in those cases). Of course, we try other avenues first before deploying one of those blocklists, and are way happier, just like our users, when we can actually find a fix in our code.
Developers sometimes take a look at Socorro themselves to find items to work on, and often start looking into a crash when a bug is being filed for which they might get bugmail - otherwise, CrashKill tries to get the right people involved and looking into the bugs. They look at the crash stack as listed in our crash data (see an example, those are the pages as linked in about:crashes as mentioned before) as well as other data submitted with the crash, and try to figure out how to avoid the problem in our code. When they find a fix, it goes through the usual review process and gets and "checked in" to the code, just like other code changes. At that point, it might make sense to apply the same fix to the Aurora and Beta channels, which are the Firefox versions that are in a testing and stabilization phase - CrashKill and release management will track those and get the needed approvals in place where warranted, so that the fix can go out to more of our users faster than most other code developments and releases become more stable.
From this code, binary builds are generated regularly - every day for the on-the-edge-of-development Nightly channel as well as Aurora, roughly every week for Beta, and every six weeks for releases (unless grave security or stability issues arise, where we'll do an updated release in between with just the fixes for those specific issues). In the building process that creates those binaries, symbol files are created next to the actual builds that are to be delivered to users and testers. Those symbol files are sent to our symbol server, where developers have access to them for debugging - and from where the Socorro processors fetch them to translate addresses in binary files into function names and source lines, as mentioned above. The builds themselves are what we deliver to the user, who now hopefully won't see this particular crash happening again, so its "life" is concluded and everybody is hopefully happier.
Though, in some unfortunate cases, a user might hit a different unexpected exit of the Firefox process, Breakpad triggers, showing the Crash Reporter dialog - and the life cycle of another crash begins...
From the user point of view as well as the existence of a crash report, everything starts with the crash reporter dialog. Of course, a number of things have happened before, but as it's (supposed to be) a "life" cycle, so this point is as good as any other to start with.
Now a "crash" is nothing else other than an unsuspected exit of a running process - usually because execution ran into a state that was not intended and the code does not "know" how to continue executing. When this happens to a Firefox process, the routines for that unplanned exit call up the Breakpad code that fetches data like the "stack" of running function calls, the loaded libraries/modules, activated add-ons and some metadata from the crashed Firefox process, building up a crash report. In addition to that, the dialog prompts the user if the reports should be sent to Mozilla, and if so, for a comment, and email, and inclusion of the address for the currently active browser tab. On clicking any of the "Restart" or "Quit" buttons, the report is being sent (unless the checkbox for sending is deactivated).
In case the crashing process is a plugin container (the separate process Firefox runs for executing plugins), what happens is almost the same, just that the dialog window is replaced by a prompt for sending the report, which is being displayed instead of the plugin right on the web page. This prompt has no possibilities for specifying comments, email, or website addresses, so plugin reports miss those. Also, in case such a plugin container does not react to messages from the host Firefox process for a certain time (currently 45 seconds), that host process kills the plugin process and reports are being sent from the state of both the plugin and browser process, marked as "hangs" with special IDs to match them together on the server side.
Inside a running Firefox browser, the about:crashes page can be used to look up reports that have been generated by this installation. This page contains the report ID and a submission or creation date. If the ID ends up with six digits representing the submission date (in YYMMDD format), then it has been received by the servers correctly and the ID is linked to the server-side page for the report. If the ID ends in a random hex code, it was not received correctly (e.g. because of connection problems) and clicking the link triggers re-submitting it.
The server side of this system we're using at Mozilla is called Socorro. When a crash report is being submitted, it's first being received by a collector, which stores a raw dump of it and signals back to the client - this is where the date at the end of report IDs comes from (replacing the random hex sequence created by the client). Then, reports are being processed - though Socorro "throttles" this processing in certain cases to save disk space and analyzing CPU power. For release versions of Firefox desktop, usually only 10% of the incoming reports are processed, but for development/beta versions or other products, all reports are beings processed, as well as any reports including comments. The processor makes the data of the report more accessible to analysis, its main job is to make the crash stack human-readable and derive a "signature" for the crash. The original stacks in the raw reports are only combinations of names of libraries and addresses inside those, the processor connects that information via symbol files to function names as well as source file names and line numbers. Based on information of what function names are in the exit path after the crash has already been caused, the stack can be cleaned up from those unneeded frames, and based on knowledge of some less interesting function names, the top frame/function and/or some of its callers form the signature. After this treatment, the report is ready for being analyzed.
Every morning, Socorro runs jobs to create reports from the processed crash data of the last (UTC) day. For example, it counts how often a signature has been seen on that day for a certain Firefox version and saves that data to be displayed in the "topcrash" reports (internally called TCBS for "top crashes by signature", probably our most important report format).
To know what versions to produce those aggregations for, Socorro looks at Mozilla's FTP server for newly appearing builds a few times a day and automatically adds them to its list of known versions.
Those daily jobs also incorporate ADI data from the metrics team ("active daily installations", often also called ADU for "active daily users" although it's actually the number of requests to the addon blocklist that every profile should make once a day when being used), so that we can calculate "crash rates", which are displayed even at the crash-stats.mozilla.com front page in a graph to identify how well different versions are doing overall.
In addition, a number of important fields of all the processed reports of that last day for mobile and desktop Firefox are put into machine-readable files (*-pub-crashdata.csv.gz in the date-based sudirectories of the crash_analysis machine) so that some people can run their own reports on them, either for their own analysis or for prototyping functionality that later should make it into Socorro itself. One outcome of that data is also the "Are We Stable Yet?" dashboard I created for a quick Firefox stability overview.
Now that our crash report has made it far enough that it's available in reports, the CrashKill team (which I'm working in) and volunteers helping that effort come into play. We're taking a look at those reports (in Socorro as well as chofmann's and my own places for prototyping new ones) every day to make sure we catch all series of crashes that need someone to look into. As part of our mission is to ensure and improve Firefox stability, we make sure there are bug reports filed in Bugzilla for all important crash issues, work with the QA team to try to get the crashes reproduced where needed to get more data for developers to look into, try to get developers assigned to work on the problems, work with release management to get proper tracking and prioritization of highly-visible crashes as well as fixes landed in stabilization phases/branches, and coordinate with others as needed, e.g. for relations to external partners. On the latter, a number of the crashes we catch are 3rd-party issues that can't be fixed in our code, in those cases we try to work with the respective code vendor (mostly add-on creators, plugin and security suite vendors) to come to a solution, and many of them are quite responsive and fix issues on their side. In extreme cases, where Firefox becomes (mostly) unusable in combination with the 3rd-party software, we can make use of add-on or DLL blocklists that prohibit loading of that software. In cases of crashes with 3D or hardware acceleration features of graphics card drivers, we can block usage of those features with those drivers and fall back to non-accelerated and/or 2D display (except for WebGL, which needs to get deactivated in those cases). Of course, we try other avenues first before deploying one of those blocklists, and are way happier, just like our users, when we can actually find a fix in our code.
Developers sometimes take a look at Socorro themselves to find items to work on, and often start looking into a crash when a bug is being filed for which they might get bugmail - otherwise, CrashKill tries to get the right people involved and looking into the bugs. They look at the crash stack as listed in our crash data (see an example, those are the pages as linked in about:crashes as mentioned before) as well as other data submitted with the crash, and try to figure out how to avoid the problem in our code. When they find a fix, it goes through the usual review process and gets and "checked in" to the code, just like other code changes. At that point, it might make sense to apply the same fix to the Aurora and Beta channels, which are the Firefox versions that are in a testing and stabilization phase - CrashKill and release management will track those and get the needed approvals in place where warranted, so that the fix can go out to more of our users faster than most other code developments and releases become more stable.
From this code, binary builds are generated regularly - every day for the on-the-edge-of-development Nightly channel as well as Aurora, roughly every week for Beta, and every six weeks for releases (unless grave security or stability issues arise, where we'll do an updated release in between with just the fixes for those specific issues). In the building process that creates those binaries, symbol files are created next to the actual builds that are to be delivered to users and testers. Those symbol files are sent to our symbol server, where developers have access to them for debugging - and from where the Socorro processors fetch them to translate addresses in binary files into function names and source lines, as mentioned above. The builds themselves are what we deliver to the user, who now hopefully won't see this particular crash happening again, so its "life" is concluded and everybody is hopefully happier.
Though, in some unfortunate cases, a user might hit a different unexpected exit of the Firefox process, Breakpad triggers, showing the Crash Reporter dialog - and the life cycle of another crash begins...
Von KaiRo, um 20:33 | Tags: CrashKill, Mozilla, Socorro | keine Kommentare | TrackBack: 1