The roads I take...
KaiRo's weBlog
| Displaying recent entries in English and tagged with "Socorro". Back to all recent entries |
October 13th, 2015
Shortening Crash Signatures: Dropping Argument Lists
We already completed a first step of this effort in June: After we found that templates in signatures were often fluctuating wildly in crashes that belonged to the same bug, all <sometemplate> parts of crash signatures were replaced by just <T>.
That made a signature like this (from bug 1045509, the [@ …] are our customary delimiters for signatures, not really part of the signature itself though):
[@ nsTArray_base<nsTArrayFallibleAllocator, nsTArray_CopyWithMemutils>::UsesAutoArrayBuffer() | nsTArray_Impl<unsigned char, nsTArrayFallibleAllocator>::SizeOfExcludingThis(unsigned int (*)(void const*)) ]
be shortended to:
[@ nsTArray_base<T>::UsesAutoArrayBuffer() | nsTArray_Impl<T>::SizeOfExcludingThis(unsigned int (*)(void const*)) ]
Which is definitely somewhat better to read and put in tables like topcrash reports, etc. - and we found it did not munge bugs together into the same signature more than previously, at least to our knowledge.
But we found out we can go even further: Different argument lists of functions (mostly due to overloading) did as far as I remember not help us distinguish any bugs in the >4 years I have been working with crashes - but patches changing types of arguments or adding one to a function often made us lose the connection between a bug and the signature. Therefore, we are removing argument lists from the signatures.
The signature listed above will turn out as:
[@ nsTArray_base<T>::UsesAutoArrayBuffer | nsTArray_Impl<T>::SizeOfExcludingThis ]
Today, we have run a script on Bugzilla (see bug 1178094) to update all affected bugs to add the new shortened signature to the Crash Signatures field without sending a ton of bugmail.
We have tested in the last weeks that Socorro crash-stats can create the new shortened signatures fine on their staging setup and that generation of the special "shutdownhang | …" signatures for browser processes that did take more than 60s to shut down and "OOM | …" for out-of-memory crashes do still work in all cases where they worked before.
As all preparation has been done, we will flip the switch on production Socorro crash-stats in the next days, and then those shortened signatures will be created everywhere.
Note that this will impede some stats that are comparing signatures across days, even though we will see to reprocess some crashes to make the watershed be at a UTC day delimiter so that as few stats as possible are disturbed by the change.
Please let me know of any issues with those changes (as well as any other questions about or issues with crash analysis), and thanks to Lars, Byron (glob) and others who helped with those changes!
By KaiRo, at 20:14 | Tags: Bugzilla, CrashKill, Mozilla, Socorro | 2 comments | TrackBack: 0
July 6th, 2012
Life Cycle of a Firefox Crash - Video and Graph
On June 14, I talked about this at a Mozilla Brown Bag, and there is a video of this presentation up on Air Mozilla.
The slides for that talk are available as well.
And here's the graph on the Firefox Crash Life Cycle I used at the end of that talk:
By KaiRo, at 16:09 | Tags: CrashKill, Mozilla, Socorro, video | 3 comments | TrackBack: 0
May 3rd, 2012
The Life Cycle of a Crash
From the user point of view as well as the existence of a crash report, everything starts with the crash reporter dialog. Of course, a number of things have happened before, but as it's (supposed to be) a "life" cycle, so this point is as good as any other to start with.
Now a "crash" is nothing else other than an unsuspected exit of a running process - usually because execution ran into a state that was not intended and the code does not "know" how to continue executing. When this happens to a Firefox process, the routines for that unplanned exit call up the Breakpad code that fetches data like the "stack" of running function calls, the loaded libraries/modules, activated add-ons and some metadata from the crashed Firefox process, building up a crash report. In addition to that, the dialog prompts the user if the reports should be sent to Mozilla, and if so, for a comment, and email, and inclusion of the address for the currently active browser tab. On clicking any of the "Restart" or "Quit" buttons, the report is being sent (unless the checkbox for sending is deactivated).
In case the crashing process is a plugin container (the separate process Firefox runs for executing plugins), what happens is almost the same, just that the dialog window is replaced by a prompt for sending the report, which is being displayed instead of the plugin right on the web page. This prompt has no possibilities for specifying comments, email, or website addresses, so plugin reports miss those. Also, in case such a plugin container does not react to messages from the host Firefox process for a certain time (currently 45 seconds), that host process kills the plugin process and reports are being sent from the state of both the plugin and browser process, marked as "hangs" with special IDs to match them together on the server side.
Inside a running Firefox browser, the about:crashes page can be used to look up reports that have been generated by this installation. This page contains the report ID and a submission or creation date. If the ID ends up with six digits representing the submission date (in YYMMDD format), then it has been received by the servers correctly and the ID is linked to the server-side page for the report. If the ID ends in a random hex code, it was not received correctly (e.g. because of connection problems) and clicking the link triggers re-submitting it.
The server side of this system we're using at Mozilla is called Socorro. When a crash report is being submitted, it's first being received by a collector, which stores a raw dump of it and signals back to the client - this is where the date at the end of report IDs comes from (replacing the random hex sequence created by the client). Then, reports are being processed - though Socorro "throttles" this processing in certain cases to save disk space and analyzing CPU power. For release versions of Firefox desktop, usually only 10% of the incoming reports are processed, but for development/beta versions or other products, all reports are beings processed, as well as any reports including comments. The processor makes the data of the report more accessible to analysis, its main job is to make the crash stack human-readable and derive a "signature" for the crash. The original stacks in the raw reports are only combinations of names of libraries and addresses inside those, the processor connects that information via symbol files to function names as well as source file names and line numbers. Based on information of what function names are in the exit path after the crash has already been caused, the stack can be cleaned up from those unneeded frames, and based on knowledge of some less interesting function names, the top frame/function and/or some of its callers form the signature. After this treatment, the report is ready for being analyzed.
Every morning, Socorro runs jobs to create reports from the processed crash data of the last (UTC) day. For example, it counts how often a signature has been seen on that day for a certain Firefox version and saves that data to be displayed in the "topcrash" reports (internally called TCBS for "top crashes by signature", probably our most important report format).
To know what versions to produce those aggregations for, Socorro looks at Mozilla's FTP server for newly appearing builds a few times a day and automatically adds them to its list of known versions.
Those daily jobs also incorporate ADI data from the metrics team ("active daily installations", often also called ADU for "active daily users" although it's actually the number of requests to the addon blocklist that every profile should make once a day when being used), so that we can calculate "crash rates", which are displayed even at the crash-stats.mozilla.com front page in a graph to identify how well different versions are doing overall.
In addition, a number of important fields of all the processed reports of that last day for mobile and desktop Firefox are put into machine-readable files (*-pub-crashdata.csv.gz in the date-based sudirectories of the crash_analysis machine) so that some people can run their own reports on them, either for their own analysis or for prototyping functionality that later should make it into Socorro itself. One outcome of that data is also the "Are We Stable Yet?" dashboard I created for a quick Firefox stability overview.
Now that our crash report has made it far enough that it's available in reports, the CrashKill team (which I'm working in) and volunteers helping that effort come into play. We're taking a look at those reports (in Socorro as well as chofmann's and my own places for prototyping new ones) every day to make sure we catch all series of crashes that need someone to look into. As part of our mission is to ensure and improve Firefox stability, we make sure there are bug reports filed in Bugzilla for all important crash issues, work with the QA team to try to get the crashes reproduced where needed to get more data for developers to look into, try to get developers assigned to work on the problems, work with release management to get proper tracking and prioritization of highly-visible crashes as well as fixes landed in stabilization phases/branches, and coordinate with others as needed, e.g. for relations to external partners. On the latter, a number of the crashes we catch are 3rd-party issues that can't be fixed in our code, in those cases we try to work with the respective code vendor (mostly add-on creators, plugin and security suite vendors) to come to a solution, and many of them are quite responsive and fix issues on their side. In extreme cases, where Firefox becomes (mostly) unusable in combination with the 3rd-party software, we can make use of add-on or DLL blocklists that prohibit loading of that software. In cases of crashes with 3D or hardware acceleration features of graphics card drivers, we can block usage of those features with those drivers and fall back to non-accelerated and/or 2D display (except for WebGL, which needs to get deactivated in those cases). Of course, we try other avenues first before deploying one of those blocklists, and are way happier, just like our users, when we can actually find a fix in our code.
Developers sometimes take a look at Socorro themselves to find items to work on, and often start looking into a crash when a bug is being filed for which they might get bugmail - otherwise, CrashKill tries to get the right people involved and looking into the bugs. They look at the crash stack as listed in our crash data (see an example, those are the pages as linked in about:crashes as mentioned before) as well as other data submitted with the crash, and try to figure out how to avoid the problem in our code. When they find a fix, it goes through the usual review process and gets and "checked in" to the code, just like other code changes. At that point, it might make sense to apply the same fix to the Aurora and Beta channels, which are the Firefox versions that are in a testing and stabilization phase - CrashKill and release management will track those and get the needed approvals in place where warranted, so that the fix can go out to more of our users faster than most other code developments and releases become more stable.
From this code, binary builds are generated regularly - every day for the on-the-edge-of-development Nightly channel as well as Aurora, roughly every week for Beta, and every six weeks for releases (unless grave security or stability issues arise, where we'll do an updated release in between with just the fixes for those specific issues). In the building process that creates those binaries, symbol files are created next to the actual builds that are to be delivered to users and testers. Those symbol files are sent to our symbol server, where developers have access to them for debugging - and from where the Socorro processors fetch them to translate addresses in binary files into function names and source lines, as mentioned above. The builds themselves are what we deliver to the user, who now hopefully won't see this particular crash happening again, so its "life" is concluded and everybody is hopefully happier.
Though, in some unfortunate cases, a user might hit a different unexpected exit of the Firefox process, Breakpad triggers, showing the Crash Reporter dialog - and the life cycle of another crash begins...
By KaiRo, at 20:33 | Tags: CrashKill, Mozilla, Socorro | no comments | TrackBack: 1
August 16th, 2011
Crash-stata Now Splits Data For Betas And Release
After working late hours last week, working on the weekend for a first deployment on Sunday, and doing a bugfixing all-nighter until this morning, this great group of people made sure that we have better-fitting crash analysis infrastructure in place for today's Firefox release than for the last one six weeks ago.
So, what has changed? Doesn't the crash-stats front page look the same as before? Not entirely. The devil is in the details. The old one was almost, but not quite unlike the tea we wanted to drink. The new one actually is brewed out of leaves and hot water, to stay with the analogy borrowed from Douglas Adams. In the updated version we're running now, you'll see that on the front page we replaced "6.0" with "6.0(beta)" at this moment and in the next days we won't have completely unusable crash rates for the release, like we had for 5.0 six weeks ago.
The reason is that betas and releases are now processed very differently. We now get graphs and reports for every single beta build we push out, and for the final release build separately on the beta channel and on "the release channel" - even though all of those report in with exactly the same version number. When you see or select graphs for "6.0b1" through "6.0b5", Socorro actually internally looks for a "6.0" version number, the "beta" release channel and the right build ID that corresponds to the fifth build we created on the beta channel for 6.0.
When we generate the final release builds, we also push them to the beta channel, which is reported as "6.0(beta)" there, while "6.0" now only looks at other channels (mostly "release" but also things like the "default" channel used by e.g. Linux distro builds). As we process only 10% of all crashes in the latter category but 100% for the former, splitting those apart makes both have correct crash rates, being able to account for the difference with a factor (not being able to do that and mixing values for both caused unusable crash rate numbers in the last cycle).
In addition, the team also fixed a discrepancy between crash counts that have been previously done per Pacific Time day and ADUs which are done per UTC day - now both (for betas and releases) are counted per UTC day, making the rates more meaningful.
With all that, we now will be able to compare different betas against each other in a meaningful way, as well as beta and release, look for differences and spot regressions more easily. Still, note that this is for betas and releases only, while we have plans for improving Nightly and Aurora reporting as well, those for now stay with the "old" reports. Also, this is only the first stage, and small glitches are possible, though some more visible regressions have been fixed earlier today as mentioned.
Getting this to work was not "just adding a line of SQL" as someone suggested to me some time ago, but it required getting the necessary data in the correct tables, creating new data aggregation tables and mechanisms, fetching the needed data from the proper places, making the UI use the new aggregations and making other parts of the system play together with those changed reports properly. Many thanks to the Socorro team for getting all this done in time for today's Firefox release!
I hope the team gets some good sleep and rest after this now while we are starting to actually use their newest work, so they're fit for the future. In the end, we have more requests for improvements come their way as we're trying to get all the data we need for making Firefox even more stable - it's surely not a boring place to work at for either one of us...
By KaiRo, at 22:42 | Tags: CrashKill, Mozilla, Socorro | no comments | TrackBack: 1
August 3rd, 2011
Crash-stats Update, Planned Changes, And Crash Rates
Now, did released "stable" and beta versions of Firefox Mobile suddenly become almost 3 times as crashy within a day?
Thankfully not. The data on the graphs actually was undercounting crashes until the newest set.
Last night, the Socorro team released their newest release, 2.1, to our production servers. That didn't just mean that colors on more detailed graphs are fixed and that source code should be linked correctly even for aurora and beta trees, it most of all means that crash counts now include all actual reports, for Fennec that is very significant, as crashes in content (website) processes have not been counted so far.
So, starting with the August 2 numbers (no old numbers are being backfilled), the system actually counts and displays crashes in websites in Fennec numbers and rates - and as we encounter almost double the amount of crashes in content processes than in browser processes on mobile, the total numbers seem to roughly triple.
This also means that the reported rates for Fennec are in the same general area as the Firefox desktop numbers - at least they match what was listed for them until yesterday. People who have been watching those numbers might recognize a change in this graph as well:
And I mean neither that the blue line for Nightly is very high due to some recent regressions that our developers are working on, nor that Aurora (orange) is visibly going down after we fixed a prominent Flash hang (but more work is needed on crashes there), nor that Beta (green, the Flash hang fix not yet being visible) and Release show slightly higher rates on the weekend (Jul 30/31). Those are all normal mechanics.
I mean that numbers there for all days and versions looked somewhat lower yesterday than they do today. Beta and Release were almost identical before, slightly below that 1.5 line, and now they're distinct and mostly over that line.
As I hinted above, this is due to including *all* reports now, and due to a difference with mobile. And it's not content processes, which we don't have on desktop - but websites are still crashing mostly the same. It's data we have counted before (that's why it affects previous days as well) but not displayed in graphs: plugin crashes.
Fennec right now doesn't support plugins, so the 0.4-0.5 "crashes per 100 ADU" we see on Firefox desktop releases, mostly from Flash, but also other plugins, probably some even crashes in our plugin process that are not the plugins themselves, are missing out there, and that explains the remaining difference in rates between Fennec and Firefox. It also explains why graphs changed across the board today.
But the team is already working on more changes to come very soon:
Right now all builds with the same version number are lumped together in the same graphs and topcrash reports - but in the future, we need to tell apart different Beta builds so we can see if one beta is better than the previous one. We also need to be able to tell apart a release build on the beta and release channels, as the people on those channels have different usage habits, and even more importantly, we throttle crash reports on the release channel (we only process and therefore count 10% of the reports, actually, to not overwhelm server storage) but not on beta, which made crash rate calculations be quite useless around the last release. Because of that, the team is working hard to get this work done before we go for the next release on August 16, and then we should have useful numbers this time around.
This will also mean that we will have distinct numbers for every beta in the next cycle, which will be very helpful but probably will have its own fun repercussions - I should blog about that when this comes around.
I'd like to close out this post with some words of general caution when looking at crash rates or numbers: Never take them as a general stability measure without monitoring more closely what's behind them and why they look like they do!
Those rates lump together all kinds of issues:
- Regressions in our code that cause crashes,
- a long tail of residual crash issues in our code,
- binary libraries of valid third-party software or adware/malware hooking into our code and causing crashes due to incompatibilities, often because they're crafted for peculiarities of some other Firefox version,
- security tools like e.g. Norton Site Advisor or AVG not knowing that particular Firefox version and causing crashes by blocking it,
- add-ons with crashy code or incompatibilities,
- operating system or driver libraries that we call ending up crashing or showing incompatibilities with our code,
- out of memory issues,
- plugins (often Flash) crashing or not reacting in time (hanging), sometimes even crashing the browser process,
- website changes or new websites uncovering lingering crash issues (often from the long tail of residual crashes),
- and probably some more.
If you look at this list, it becomes pretty clear that using those numbers as a firm measure of how stable a release or version is probably is not a really fair value. Still, the numbers are quite helpful in discovering if there is some kind of regression or new issue, if there is a good or bad trend, if some fix for a large issue has an overall effect - but following that discovery, we need to look closely at what the discovered issue really is and how we need to deal with it - that can be getting our developers to look into it, contacting third parties to work on a fix, blocking some add-on or library, advising release drivers of problems to track, etc. - and that's what's actually the job the CrashKill team, including myself, ends up doing most of the day.
By KaiRo, at 18:32 | Tags: CrashKill, Mozilla, Socorro | no comments | TrackBack: 1
June 10th, 2011
The New "Crash Signature" Field in Bugzilla
It wasn't just messy up to now for Socorro (crash-stats) to parse the signatures out of headers, we also had a number of bugs where the constraints on the summary length caused problems with adding multiple signatures or a very long signature - not to speak of making summaries hard to read for humans. All that should get solved with the new special-purpose field that also allows greater length of its content.
You might have noticed that this "Crash Signature" field has turned up in many products of Mozilla's Bugzilla in the last days, below the tracking/status/blocking flags on the right above the comments.
The syntax for entering signatures is similar to what we did put into summaries so far: a signature is listed as [@ crash::signature(params) ] - for multiple signatures, we prefer to use a new line for every signature to make the field more readable, though.
We're migrating over the contents from summaries to the new field, so you don't need to copy those manually - some of that happened already, but the script didn't catch all necessary bugs the first time it ran. Also, the work on Socorro to use the new field has not landed on production yet. Both should be done soon (within days), and then we can fully use the new field. Until then, it's best to leave the field alone and keep using the summary (though for those cases where the summary is too small, populating the new field right away is safe).
In the future, i.e. once Socorro uses the new field, we'll also be able to make summaries more human-readable, e.g. "[meta] EnterMethodJIT crashes" or something similar will be far easier on the eyes for bug 595351.
Thanks to everyone who worked and is still working on implementing that!
By KaiRo, at 19:14 | Tags: Bugzilla, CrashKill, Firefox, Mozilla, Socorro | no comments | TrackBack: 0
May 12th, 2011
Full Time at "CSI:Mozilla"!
I've now also gained a real @mozilla.com account and can regard myself as a real part of the Mozilla workforce now. I'm looking forward on more analysis of crashes on one hand, of areas where we need better crash stats on the other hand, along with finding out specs of what exactly we need there and trying to drive and accompany those to completion. That basically means I'm continuing what I did in the last months, but with even more intensity.
As the internal phonebook has a freeform field for a job description, I've been thinking a lot recently what I should put in there (left it blank for the moment) and as I've putting most of my work into investigating surrounding of crashes and the scenes of crash analysis, I'm more and more leaning toward "Crash Scene Investigator - CSI:Mozilla", what do you think?
Whatever exact wording I put in there, I think I found an interesting corner in this project to work on, with a lot of interaction with other people int he project, just as I like it, and I'm proud of helping to improve the stability of Firefox and reduce frustration for hopefully a lot of people out there!
Update: The @mozilla.com account mentioned above is now reachable with "kairo" before the "@"!
By KaiRo, at 02:55 | Tags: CrashKill, Firefox, future, Mozilla, Socorro | 7 comments | TrackBack: 0
March 1st, 2011
What Should crash-stats Do For You?
In this context, if you are using the crash-stats system, I would like your input; What are some of the use cases you run into most often? I'd be interested in hearing from developers, QA, release manager, etc. What questions are you trying to answer from crash-stats?
I will be going through many of the items currently in bugzilla and helping to prioritize them as well as put together detailed specs. Helping determine the criteria for detecting explosive crashes is one issue I am currently working on. If you think of additional reports that would be useful, please make sure there are bugs filed (under Webtools/Socorro) and CC me so I can understand the requests and help to get them in the development pipeline.
Thank you!
By KaiRo, at 22:11 | Tags: Firefox, Mozilla, Socorro, Veridian 3 | 8 comments | TrackBack: 0
February 16th, 2011
Contracting for Mozilla!
Given all that, I started to shift a number of my responsibilities in the SeaMonkey project over to other people - like Callek for release engineering, for example - and I'm working on more in that area. In the end, I probably want to end up only having the German localization and the comm-central build system ownership left, as well as being one of the members of the SeaMonkey Council - for now. But there's still some way to go there, and I'm trying to make this a smooth, step-by-step process so that the SeaMonkey project can come out as strong or even stronger than before from this period.
At the same time - and here's the meat of this post as well as something that has been brewing for a while - I'm starting to get my feet wet in different areas. Starting today, I'm officially working for Mozilla part of my time - for now, this means contracting on a roughly 3-month project I'm working on half-time, but I hope this proves to be a fruitful relationship that has some great times still to come.
My work within Mozilla is positioned in the general area of program management - the concrete project I will be working on in the upcoming months is putting together a strategy for Socorro (the crash stats system we're using). I will not work on the code, but rather work with the Socorro team and the other Mozilla managers and developers to find a roadmap for what the developers will work on in the following months that will bring some additional perspective of how to use the crash system to help enable people to make product release decisions, and not just around analyzing specific crash bugs. There is a lot of work here to consolidate the over 400 change requests and bugs that have built up around the Socorro system, and create some more details specs for some of the more intricate areas that will help us to build systems that help to understand our crash data better, and how to use it more effectively. I can definitely use suggestions across the entire community of how we can meet this goal more effectively.
The positive side in terms of a transition for myself is that Socorro affects all Mozilla applications and I have some experience (even if not too much) in looking at it and seeing what this system is actually about. Also, having release management experience for a Mozilla project and having been following discussions of Firefox release managers as well as the security group helps in having insights into what is in the focus when it comes to looking at crash reports and statistics.
For me personally, I have dubbed this project "Veridian 3" and in the first round of bug triage I'll be doing, my internal tags will start off with "V3" because of that - but this name will probably not leak outside my private use, for other people's sanity.
There's a good amount of intense work ahead and some things I'm not used to like reporting to a specific person or making plans other people are actually paid to work on, but I'm confident to master those and come out with having my part in helping us all to have a better crash stats system.
So be prepared to see me appearing somewhat less in the usual places but instead in some new areas in the near future!
By KaiRo, at 14:24 | Tags: Firefox, future, Mozilla, SeaMonkey, Socorro, Veridian 3 | 7 comments | TrackBack: 2