♪ [MUSIC] ♪ [NARRATOR] Welcome to Unite Nowwhere we bring Unity to you wherever you are.
[ABEL] Hi, everyone, thanks for joining us on this talk: Delivering Better Games Fasterby Leveraging Next-Gen Crash Data.
Given that this is COVID-19, I'm really excited aboutthe change in format where this talk was originallysupposed to happen at GDC with Fred Gill who's Head of Technologyat Respawn, myself.
But given thatwe're doing this remotely we're gonna have much moreconversational format where both of us get to talkabout exactly this topic: Delivering Better Games Fasterby Leveraging Next-Gen Crash Data.
Just a quick round of intros, the guy talking right now, my name is Abel, I'm Co-Founder of Backtracewhere we build a crash analysis and debugging solution for the game market as well asa number of other markets.
Fun fact about meis that I got addicted to FPS games starting at CS 1.
6, so I've been playing gamesfor most of my life.
Fred, wanna introduce yourself? Hi, I'm Fred Gill, I'm Head of Technology at Respawn Entertainment.
I don't really have any fun facts, but I've been in the games industryover 36 years now.
Before joining Respawn, I was the tech director of EA Partners for 10 yearshelping third-party studios bring their products to the marketwith EA as the publisher, and help Titanfall 1and Titanfall 2 ship with Respawn.
Respawn was acquired byElectronic Arts in December 2017, and I joined as the Headof Technology in June of 2018.
[ABEL] Great, awesome, super excited to have you as part of the conversation, especially such a bigindustry veteran as yourself.
And also for allthe gamers out there I want to thank youfor Apex Legends, Titanfall 2, Star Wars Jedi: Fallen Order.
Though it's taken upmany hours of my life, definitely I appreciatedall the great gameplay.
Cool, glad you're enjoying it.
[ABEL] So just a quick, as part of this intro, who should be listening, who's this talk really for.
If you're a games programmeror a systems programmer, let's say, you work onthe infrastructure or tools as part of the game studio, QA, play testing, or you're managingany of these resources as a technical directoror engineering manager, this talk is really centered around actionableand insightful information for your day-to-day.
As part of this talkwe're gonna start off with the obligatory intro of whyyou should care more about crashes.
I feel like this iscommon knowledge in the gaming industry today, but it's always goodto refresh ourselves, given that crash analysisand debugging crashes, even though it's partof everything that we do, it's still somethingthat's been kind of put off to the side, and so remembering why crashes are importantis what we'll focus on initially.
But really what we wanna talk about is given wheremost systems are today, what's the next-gen systemslook like and, in fact, those next-gen systemsalready exist, and how they can proveyour best practices and the way that you operate for resolving crashesacross the game lifecycle.
One key thing that I feel likeI don't have to say to anyone in the games worldis that crashes and hangs are inevitable, with very few exceptions your game is going to crashor hang in the wild.
If you're producing a PC gameit could be massive amounts of different configurations, different graphics cards, different amounts of RAM, not meetingthe minimum requirements.
But even on consolesor fixed sets of hardware, we still see massive amountsof crashes happening to game, and that's just becauseyou cannot predict all of the different scenariosin which your game will be tested.
Even the best QAand testing systems, which we'll talk with Fred as well, at the biggest AAA game studios, they don't catch all the crashes.
Millions of users means millionsof combinations of environments means more game crashes.
And the really easy wayto talk about why you should focuson game crashes is game crashes means angry gamers which means lost revenue, potential trolls or haters on forums talking about how muchthey hate crashes of your game, which ultimately leads toa less successful game.
It's reputational damage, it's player churn, but also it's frustratedengineers as well.
When you don't have what you needto be able to resolve these crashes, you're taking awayfrom all the hard work that your team has put in, all of the lovethat your team has put in to be able to release this game.
So I wanted to actuallyhand it off to Fred here because givenhis industry experience it's one thing for me to saythat crashes will happen and they lead to negative outcomes but, Fred, you've got so manydifferent war stories, I thought you could sharea couple here.
So cast your minds back to 2013, for anybody that was aroundin the industry then, EA had a couple ofvery notable releases that year, but sadly they were notablefor all the wrong reasons.
So we had SimCity launch in March of 2013, which had lots of server issues, and then later that year, sadly, we had Battlefield 4 launch, which had lots ofclient stability issues.
And as Abel said, reputation, angry gamers, recovering that was extremely hard, and I'm not surewe did with SimCity and it was very difficultwith Battlefield 4 to get that goodwill backfrom the game once you've lost it, and the clients were crashingall over the place.
EA had, in its infancy, something then called system test, if you want to press the.
.
.
which set out with EA to create what we callboring launches.
So we do a lot of workaround the time a game ships, kind of the yearup to a game ships, and around 2013, even then, you could see that a lot of games were becoming live services.
So they'd launch, but there'd be an online component, whether that wasfree updates for a year or whether it was a live service that people paid for, et cetera, those were developing, so we wanted boring launches, so few issues when we launched so few server crashes, few client crashes, working together seamlessly.
So we run this program called system testwhich is load testing, spike testing, reliability testing, stability testing on the clientand the servers before we launch.
And a key part of thatis for an online game like Apex that has a client and a serveris understanding the crash rate of the clientsand the crash rates of the server.
But as Abel said, with the best will in the world system test gets close, it doesn't mirrorwhat the public do when they get hold of our game, so as good as our QA isand as good as system test is there are still errors and problemsthat leak out when we ship a game.
No game launchis ever completely perfect, but with the system test processwe've had in place for this past seven years, every time we launch a game we learn from it, and we get betterand better and better, so you'll have seen thatlots of EA launches recently have been much more stableand what we call boring.
The best thing for us is whenwe have a war room of people that we bring togetherfor those launches that they are completely bored because there's nothing to do, no fires to go and fight, no service to go and investigate while they're crashing, et cetera.
So as Abel said, it's a key part of our philosophyto minimize those before we launch.
[ABEL] Yeah, and it's alsoreally interesting to hear that even as late as 2013, you know, marquee titles like Battlefield 4 were the catalyst for youwhile creating the system test, the boring launches, the live services.
Just curious, talk to me about the effort that went into actuallymaking boring launches a thing, was it a massive team effort? We created a group within EAto kind of help manage it and pull everybody togetherfrom all the different groups, and to make sure the learningsof every single launch were carried forwardto the next launch.
So it was a few hundred people that ended up consolidating thisinto a repeatable process for all the games.
So yeah, an absolute massive focus because the reputationof our game launches fed into the reputation of EA.
It's not a good place when to work as a companyor to work on a product, when it launchesand it's crashing lots because it can take days, if not weeks to go and findand fix those issues in the wild, depending onthe systems you've got.
Again, we'll cometo some of the systems EA built later on as part of this systemtest process.
[ABEL] It's fascinating to me, hundreds of people contributing to, because as you saidyou work on a game for years just to have it fail, potentially, because of game crashes, it's just a huge travesty.
And by the way, as someonewho tried to play Battlefield 4, I feel the pain.
I was really lookingforward to that title and just the massive amountof client crashes really was a detrimentto my gameplay experience.
So really appreciatethat perspective, Fred, because it really gets uson this idea of something as big as EA, as big of a title as Battlefield 4.
In-house and free solutions often suffer fromthe same problems.
What you guys are probably seeingis that you had to come together to make surethat you had full coverage.
What we'll talk about laterin this conversation is that these kindof in-house solutions often have long maintenance cycles, because it's not necessarilya dedicated team that's constantly workingon the crash analysis solution.
It's a team that's working ona full set of tools for the development of games.
Oftentimes we run into it not being cross-platformand having different silos of data.
Your PS4 data may livein another place compared to your Xbox datacompared to your PC data compared to your mobile data.
And honestly, in the wayyou've kind of described it, Fred, in some of our conversationsin the past, and many times it becomesmore of a pain than a solution where you're just constantly kind of shakingyour fist in the air at the in-house solutionbecause you want more data from it, you want to be able to dodifferent types of analysis.
So today and like the restof this conversation, we really wanna talk about hownext-gen crash analysis solutions bridge this gap, and being able to do thatwith Fred and doing a deep dive in all of their differentprocesses.
When we talk aboutnext-gen crash analysis, I kind of want to break it downin the following ways.
The first is how does the next-gen actually adjust the kind of datawe can capture, and in many casesit's far more data, but what does that data look like? Now that we have more data, we go to this next ideaof crash grouping or deduplication, which is a fundamental part ofany good crash analysis solution.
When you havemillions of crashes coming in, being able to understandwhether that crash is unique really represents the switch over into these next generation systems.
With that fundamental base unitof a crash group, we can improve analytics to be able to improveprioritization and investigation, as well as buildintelligent workflows on top of it.
When we talk about crash data and the first facetof the next-gen of crash analysis, it's now gatheringas much data as we can at the event of a crash.
This could be everythingfrom the crash dump to auxiliary log files, the operating system driverand system information.
If you're used to systems like WER, you may see some of these thingsfrom those reporting mechanisms.
Even in-house solutionslike Fred will talk about with BugSentry and the like, some of them capturethis information, but it's really about making sureyou can consistently capture this across all platforms.
Additionally, you may want to addyour own custom metadata, things like which game levelwas being played, what's the session ID so that you can track thatto the game server.
Also with consolesand even in some cases PC, you may have video playbackevents that led up to the crash.
All of this data really matters for your ability to be ableto prioritize the crashes.
So you can say things likeI want to make sure that all the crashes on my PC launch are being resolvedas quickly as possible.
Additionally, it matters to investigation.
There's nothing more frustratingas an engineer, being a software engineer myself, when I'm investigating a crash and telling myself if onlyI had this one piece of data I could test againstthis hypothesis, which often caseisn't available to me when I'm looking at crash data through existingcapture mechanisms.
I thought it might alsobe interesting for Fred to talk aboutthe type of data they capture in their crash analysis solutions across a coupleof different titles at Respawn.
So one of the key things for us is ensuring we comply with GDPR that we respect people's privacy, and that ties into actuallythe way we work with the crash dumps anyway, in that we retainall of the crash logs for less than 30 days.
If you're not looking at themkind of within a couple of days then you're not really caringabout the live service.
You really have to belooking at these regularly and nailing the top offenders, but we'll come to that later.
We are always evolvingthe data that we want to add because we're thinkingof new use cases.
Every time we kind of move the bar, we go, “Oh, maybe if we did thiswe can add some, can get some more insights, et cetera.
” One of the key things that we try and do for both titles is havea unique identifier for the players, but one that isn't associatedback to the actual player.
Because we want to track crashesfrom an individual and the reason we want that is, for example, on Jedi: Fallen Order, some crashes that we get are the result of people trying to hackthe DRM that exists.
So as part oftheir hacking endeavors, they will cause lotsand lots of crashes.
We need to remove that noisefrom the crash data so that we can concentrate onthe people that really matter.
And in a similar way, on Apex, we have cheats, cheat developers that are trying to create cheatsaround the Apex client, and they are causing crashesall the time as well.
And so again we wantto remove those from the signal, but we do these with anonymous IDsthat are the same for the player so we know it's coming infrom the same person, but we don't know who that person is.
And then another problemwe encountered on Apex is in the early days a lot of our crashes were peoplethat just ignored the min spec and were trying to runthe game on CPUs that didn't supportcertain instructions that we needed in the game, and so a lot of our crasheswere from people trying to play the gamejust on machines that weren't capable of running it.
And again, you need to removethat noise from the signal so you can concentrate on fixingthe problems for the real players.
Those are the main kind of insightsfrom us on data at this point.
[ABEL] Yeah, it's actually fascinating because I hadn't even thought about the idea of you wantto be able to identify a user, not to be ableto see crashes by user for just larger analysis to see like how oftena user is crashing, but also because there aremalicious users out there that are trying to cheat your game, hack your game and so you wantto remove that noise from all of your data analysis.
That's fascinating.
And then the min spec one too, I feel like I've heardthat one before, but system-level information like memory, CPU, model, often can really play into that, so that's fascinating too.
Good stuff.
So when we talk about crash data and everythingthat Fred talked about, so the top level, which is the crash dump and then having a specificretention period for that metadata or attribute data so that you can be ableto actively prioritize that issue, when we talk about gameslike Apex Legends, and Fred, correct me if I'm wrong, you're talking about potentiallyhundreds of thousands of crashes at any given point in time, given the massive user basethat you guys have.
It's not that level.
It's certainly thousands a day.
[ABEL] Okay, well, even thousands a day, as any human being, I could not properly analyze a thousand crashesand be able to look at which ones are similar, which ones are not.
Fred will actuallyshare a story here in a bit, but in our conversationswe've talked about how a lot of game studiosthat's actually what they're doing.
They're getting an email reportand they're looking at which issues are happening, which ones am I used to, which requires someone who knows the ins and outsof the system, knows historical dataabout the crashes and this is wherededuplication really comes in.
Deduplication attempts to mapcrashes and errors into an associated set or an underlying group.
So here you see in the diagram, being able to map errors A and B to a single fingerprint, which allows you to then really focus onthese unique issues.
And the idea is that a groupor a fingerprint fundamentally maps one-to-oneto a root cause there.
Obviously, though, and we'll talk about this, this becomes a verycomplicated problem.
Deduplication is non-trivial, given all the informationthat comes in with the crash, but the difference betweenlooking at 3, 000 crashes versus 150 crashes is significant to anydevelopment team that's trying to producea good quality game.
It also allows youto use that group as the base unit for your analysis.
So on a per fingerprintor per group basis, now you can look atwhat are all the different PC graphics cardsthat were being used during this crash, this specific unique group there, which we'll talk aboutjust here in a bit.
However, deduplication, as I said it's a non-trivial problem and that's becauseif you look at existing methods, and if you just thinkfor a couple minutes about how you would potentiallysolve this problem, using the complete call stack means that you're going to geta lot of groups that really should belong together, and that's becauseon different platforms, even different operating, versions of an operating system, your beginning framesmay actually look different even though the crashis fundamentally the same thing.
Using the innermost framealso doesn't work because it doesn't fully summarizeall the different paths that could reachthat frame as well, which is the same shortfallthat something like using an assertion stringor exception class, which we see a lot ofdifferent solutions attempt to do.
If you're simply just sayingthat these are all the crashes that I saw froman unhandled exception because of a hang, then you're not reallygetting all of the data you need to be able to accurately group.
This is further complicatedwhen you talk about developing in C++ or having your programtranslated C++ where you getincorrect call stacks, due to things like optimization, anonymous functions, templated functionsmake this problem even harder.
And so deduplication can often be the bane of one's existencewhen looking at crashes because you maybe looking at one crash, but you don't have enough data because there's a larger group of crashes that belong to the same root cause.
Or you may be looking attons of different crashes that are all being grouped together but they're completelydifferent things, which actually leadsto this example of when you use event-basedand concurrent systems.
So here you can see in the diagram that everything afterthese red arrows is fundamentallythe same call stack.
“cr_pu_enter” goes to assert.
However, the entry-level functions that led up to “cr_pu_enter”are different.
Now, in a traditional system, this would be eitherif we are just grouping on assert this would all be groupedinto a single group.
Even if we were just grouping on “cr_timerset_dispatch_current_level1″it would be one group.
But if we were groupingby the complete call stack, this would betwo different groups here.
So event-based systems, concurrent systems can actually introduceyet another level of complexity as to how you accuratelygroup your crashes together.
This again also comeswith additional complexity when you think about that there are multiplebacking functions to things like “memcpy”.
These are thingsthat are often used within games to be able to getthe most performance, and again this will introducemore complexity into deduplication.
All of this is to saythis is a problem that we've been thinking aboutquite a bit at Backtrace, and when we look athow typical solutions solve it, we understand the complexity of it.
And in fact, we put a lot of effort into trying to think about what's the best wayto group crashes.
But the flip side of thatis when crash grouping just isn't doing you any favorsit becomes a huge pain.
And so, Fred, I know you gota couple stories with Apex Legends, WER, and the like.
So EA has a technologycalled BugSentry that is pretty good.
It does crash reporting, can collect memory stats, et cetera.
But at the timewe were looking to launch Apex we estimated the amount of workto integrate that and get the data flowingthrough to the backend so that we could work with it.
And you have to remember this time that Respawn had justbeen an independent studio, so it's fiercely independent, had its own technology base and so hadn't leveraged any of EA's technology stack.
So adding in newtechnology from EA, there's a barrier to entry.
And when we looked at it, the cost of adoption was a couple of game features, and we just couldn't take thatgiven the launch window and everything that we had planned.
So we didn't adopt EA's BugSentry.
We went withWindows Error Reporting for the PC with Microsoft, and when we launched, we couldn't get it working.
We kept getting reports through about crashes that were in, inside antivirus software.
It was just a load of noise.
And so very quickly, I thinkwithin the first month of launch, after trying multiple timeson Microsoft to get the data flowing properly, we went to an alternative solution, and we wrote a very, very simplecrash handler ourselves that just created a text filein your local Documents folder, and we had the communitymail through those to us.
They could see there was no personally identifiableinformation in there, we had them mail through.
And we got a couple of hundredof crashes through.
But there was no deduplication, so we had oneof our senior engineers going throughall 200 of those bugs, looking at them one at a time, and decide whether it was a dupefrom the previous one.
And by the timewe got through those 200, I think there were fouror five unique crashes.
We managed to nail four of those, and we got rid ofa significant number of our crashes in the game.
And we had built a system, it's a fairly simple systemfor just logging how many crashes, disconnects, et cetera, we get that's pumped through to us live so we can see data in real-timethat we can react to.
But this crash handler helped us get into an okay position, and then it was obvious to usthat we needed something stronger and better than that, and so we looked at Backtrace.
io around GDC last year and had some initial conversations, and we actuallytried it on JFO first.
Again, Star Warswas coming up to launch, we were looking at how longit would take to integrate BugSentry.
Again, no tech stack from EA, and so, again, we were looking at features.
When we adopted Backtrace, I think we hadfrom the initial call to having data flowing throughwas something like, and into the backend, and in a dashboard that we can look at, was something like four hours because there was a plug-inthat existed for UE4.
And after that experiencewe decided, actually, let's.
.
.
you know, that was a hugelypositive experience, hadn't cost us a game feature, let's go back, and let's integrate itinto Apex Legends.
And that took us a couple of days, some quirkiness aroundour anti-cheat solution, et cetera, but we had data againflowing through within a couple of days, and again for us in Apex that's not a game featurethat we're having to cut at that point.
[ABEL] Yeah.
So interesting to hearthe genesis of using Backtrace or a next-gencrash analysis solution.
When we talk aboutdeduplication and grouping, you mentioned with WER, kind of this egregious grouping of error crashesall into antivirus.
I mean, talk to me just abouthow frustrating that was for your engineering team, and how did you guyseven solve crashes at that point.
Oh, we didn't.
We had nearly four weeks with not just us being frustrated at the fact we couldn't helpour community, but the community, particularly the PC players that this was affecting, being very frustrated with us, it was not healthy at all.
And I think our community manager put out a few kindof messages to calm people that we were working on it, and we were trying to get the datato flow through properly with Microsoft.
It was just so frustrating, and in the end just one ofour senior engineers just said, “Look, I'm gonna writea crash handler and spew out a small text file, people can just email them in, ” and that took about a day, two daysto get it through the pipeline, so I think it was abouta month after we shipped we finally started gettingproper PC crash data through.
And from that point on we were able to help people.
But it was hugely.
.
.
it's debilitating, we just felt like we were lettingthe community down.
Not a good place to be.
[ABEL] Yeah, Fred, so we were talking about grouping and deduplication, and it was really interesting to hear about the genesisof you all using backtrace.
io.
One of the thingsyou mentioned though was WER, and how it was groupingall of the crashes into a single groupthat was due to antivirus.
How did you guyseven solve crashes then? Talk to me about how that wasfrustrating to your engineers.
Yeah, we didn't solve crashes then.
We had about a month where we were workingwith Microsoft to try and get the dataflowing through properly, remove kind of the noise, this false positive, which we didn't solve it and it was hugelyfrustrating for the team.
We felt we wereletting down our players because this was affectingthe PC community.
And it just genuinelyis not a good place to be because we weren't ableto help our players, we weren't ableto fix their problems, they were rightlyfrustrated with us, and in the end just oneof our senior engineers said he was just, “I'm justgoing to write the crash handler, it'll output a simple text file, we'll get people to mail it in.
” And that was a hugesigh of relief from us when we had that in the wild, and it started to showthere were only four or five areas that were reallycausing problems for people, but it wasn'ta sustainable system.
So moving to backtrace.
ioallowed us to put in something that's repeatable, this is deduplication so we don't have to havean engineer looking through them all, and it really focuses you down on the handful of issuesyou really do need to solve that affect 90% of your community.
[ABEL] Well, actually, what I love about your story is that it also shows the lifecycle of what it looks likewith bad deduplication.
So on one end of the spectrum everything just getsgrouped together, and so you have no ideawhat the unique issues are, so it's all antivirus.
Your engineers are frustrated, your community is frustrated, you're throwing your fists in the air, you just don't know what's going on.
On the other end though, whenyou built your own crash handler you were gettingthese individual reports.
And so, as you mentioned earlier, then you'd be lookingat that via email, it would be a human analyzing it.
And so when you havehighly granular groups, then you actually don't know whether something'salready come in or not, and so you're kind ofmanually doing that kind of.
.
.
you're manually doingthat work there.
So it really points outhow deduplication, and getting itas accurate as possible can really mean the difference between incrediblyfrustrated engineers and community, versus waste of engineering cycles, versus somethingthat is super powerful and lets you focuson the right thing, which is somethingwe've been working on Backtrace for years now, and why when Fred saw what we produce, in addition to ease of integration, it's really thinking through what a good deduplicationengine looks like.
It means being ableto remove things like all of the entry-level functionsor exit functions.
You don't want common crash handler to affect howyou're grouping the crash.
Additionally, you wantto be able to take functions like “memcpy_sse2” and say this is justfundamentally a call to “memcpy” because that's what it looks likein my source code.
And so Backtrace has spenta lot of time building the Rules Finite State engineto really solve this problem, because again the fundamental issue is getting deduplicationas accurate as possible, and what we've noticedis that this leads to a fairly good system to be able to get the base unitof a crash group as close to the root causeas possible.
It also leads to better visuals.
So going back to the visualI actually showed you guys before which was two.
.
.
should have been two different, sorry, it was two differentcrash groups when it should have beenone single crash group, this visualization'sa flame graph visualization which comes fromthe performance management world.
If you guys are interested, brendangregg.
com/flamegraphs.
But we've used this visualization so that you can view severaldifferent call stacks at once in a single viewto help you understand similarities and differences.
With good grouping, you actuallylead to this type of visualization and so that you can be able to getmerge recommendations, which groups are fundamentally the same, versus different.
And it's not an arduous taskfor you to be able to understand that.
Good deduplication alsolets you do things like merging similarfingerprints or groups, and getting recommendations.
When you've got this base unitof fingerprints that you can trust, you can do algorithmsthat come from the string matching world, like a Levenshtein distancing.
So you can say, how far is the call stack signature or the call stackfrom this fingerprint from this differentcall stack as well.
And these are thingsthat we've been working on at Backtrace, and what we really push for in terms of next-gen deduplication.
It's super importantbecause it ultimately leads to better signal through the noise, and it leads to better toolsfor visualizations, as well as recommendingand merging similar fingerprints.
Once you've got the base unitof groups though, the real power comes inin your ability to do analytics on top of that.
And when I say real power here, we're gonna talk to Fred pretty soon about howthey all prioritize crashes.
But you've heardeven up to this conversation how things like the specsof the game machine matter to how you prioritize the issue.
You don't want to look at crashesthat are coming from machines that are below the min spec because you knewit was gonna crash anyway.
It didn't have the properCPU instructions.
It matters whether or notit's a call coming from the same single user who's actuallytrying to hack your DRM.
So all of this dataactually feeds into how you prioritize these issues.
Additionally, it also feeds inhow you investigate issues.
Is this only happeningon a single GPU? Is this only happening actuallyon a single driver version of that GPU as well? We've seen time and time again, folks looking at crashes and really wanting to be ableto test the hypothesis of, is this the latest graphics driverthat's causing these issues? And in many cases it actually is.
A group of crashes come inbecause there is some level of badness happeningthat causes the crash.
Additionally, it could actually be game logic, it could be somewherewithin your game map or it could be a certain situation.
And so you want additional data, and this is where analyticsallows you to do large-scale or bird's-eye viewhypothesis testing.
Just a reminder, when we talked about what data matters here, it's being able to do analytics on top of everything, from the dump itself, to OS driver information, to memory utilization, even to in-game state, like session ID, or even being able to aggregate things like application version or some larger indicator you havelike a build ID as well.
I thought it would be interestingfor a production level game like Apex Legends to talk about how Respawnprioritizes their crashes.
Yeah, I'm going to keepthis one simple because we actuallylet the deduplication do most of the work.
So we look regularly, if not daily, at the crashes coming in, and we're just looking atthe top three to top five to go and nail those.
Mostly, they are system bugs, so fixing them once in our enginefixes them for everybody.
And we just rinse and repeat.
It's as simple as thatbecause the deduplication does a lot of the hard work for us.
As you say, we're removing noise from the systemas much as possible, so we understand those top three, top five are the real ones.
so that if it's runningbelow min spec, if it's coming infrom a single user, there's a spike there, that's likely somebody who's trying to do something badto our game code, or could have a systemthat's just unstable and crashing, because users willquite often report that they're having a problemwith a particular game when they don't mentionthat actually their whole system is unstable for any game.
So we just let the deduplicationdo the work for us and just follow those numbers but filtering out the stuff that we don't care aboutlike min spec, et cetera, first.
[ABEL] It's really refreshingto hear that when the crash system does what you need it to do, like deduplication, and providing you the analyticsthat you need to be able to prioritize, you can filter through the noise and simply just letthe crash system say where you should befocusing your efforts.
I'm curious with Apex Legends, kind of when you comparethe previous days with WER and your own system handler, you know, one thingthat we often don't talk about when the crash analysissystem really works you're able to do other things.
So interesting kind ofcompare and contrasting, kind of now that you havethe system in place what are someof the benefits you see with respect to engineeringfrustration or happiness, being able to focuson features and the like? Yeah, so I think froman engineering perspective and a team health perspective, and also communitysentiment perspective, because that's actually oneof the things we care most about, with live service games and even single offlineplayer games like JFO, is what is the community feelingabout the games.
We track that regularly.
And all of that feeds intoactually the team health.
Because if we're makingthe players happy, then we feel betterabout what we're doing and how we're servicing them.
And deduplication and the tools that are in the backendin backtrace.
io just let us focus in, so we don't waste time, which means we're freeing upour engineering time so they can fix bugs and thengo off and do other things like make new features, and that's really important to us.
[ABEL] Yeah, it makes a lot of sense too.
As an engineerwho's worked on many crashes, not being frustratedby the crash analysis solution makes my job a lot easier for sure.
And appreciate the perspective.
And some of the thingsyou actually just mentioned here are worth reiterating in thata good crash analysis system allows you to focus onthe right priorities because you canfilter out the noise.
It's not just looking at peoplewho are trying to exploit your DRM system or antivirus.
Or, you know, if you get to another stage in tying your crash analysesinto tickets you can see whether a crashis actually being investigated at the time and seeingif it's in progress or it's something completely new.
Server crashesare even more important because they can impactseveral users at once instead of one userat a single time, like the client as well.
And so oftentimes, by nowgetting exposed to this data, being able to prioritizemore effectively, it's not just which crashesare happening the most frequently, it's which crashesare affecting the blast radius that you really care about.
We'd already talked about thisprevious to this conversation, but really being able toslice and dice according to the dimensionsof your crash data helps you do betterhypothesis testing which really leadsto better investigation.
You know, are these crashesonly happening on a specific type of GPU, a specific game level? Additionally, it helps you make sure that you're not deduplicating work.
In some cases, we've heard that there's a front lineto game crash response, someone's looking at the dashboard.
But there could be people actuallytrying to resolve the crash and fixing the code, and they could be workingon the same issue that you're looking into initially.
And so analytics reallyhelps you get better triage, helps you remove duplicate work and then it also helps youwith investigation as well.
You had mentioned this, Fred, but being able to just easilypull up a dashboard when you were talking aboutStar Wars Jedi: Fallen Order and being able to easilyanswer questions, helps you immediatelyprioritize triage and investigate.
It's not just based off a version, but it could be based offof any dimension of the crash data.
So building on top of analytics, which we builton top of crash data.
So with next-gen solutions, we can get more of the data that's necessary for prioritizationand investigation.
With the analytics, we can actually triage better.
But next-gen systems alsohelp us build a better workflow.
What I mean by thisis really best represented by practical examples.
You could alertspecific team members based off of the call stack itself.
You could route messagesto the owner or support based off of context.
Additionally, it meansthat you don't have to alert or notify on an existingcrash group that's alreadybeing worked on as well.
Fred mentionedin the case of Apex Legends and buildingtheir own crash handler, having folkslook at this text file, looking at emails manually, if you're stillcomparing graphs, in fact, of crash rates every week, then you could be doing something wrong whenyou could be immediately alerting and buildinga better workflow there, which I thought would leadto an interesting discussion just generally abouthow Apex Legends does their workflowfrom build to release.
So we have a build team that make partof our engineering team, and they build the servers, they build the clients for us, we're building a couplehundred times a day with continuous integration, continuous deployments.
And so when we make a new patch that goes out to the public all we have to do is propthe new symbols to backtrace.
io, and when that launches, everything is resolved, deduped, and then we're looking at pulling that data regularly, we go into the dashboard, we pull things.
I know there's thingslike JIRA integration, we haven't neededor used those yet, they're nice to have for usat the moment, they're things we're looking atfor the future where we can pull all the top bugs in, filter them.
But for the momentwe're manually going in there.
And I'd also say in development we have some toolingthat pre-existed backtrace.
io, so we have our ownkind of crash reporting that playtestersand our dev team use that already tagsa lot of metadata.
So we've used that and continueto use that in development, and where we've been findingthe power of Backtrace is in the retail productat the moment.
But as we've gottenmore familiar with Backtrace and started to seesome of the power, we're now starting to look athow do we adopt it more in our pipeline.
And as an examplewe're looking to put it into the crash handleron our dedicated servers.
Forget the deduplication there because that's stilla manual effort at the moment, and as you say, if a server crashes that's 60 playerspotentially affected at once.
And so we're adopting thatin the next few weeks to get better crash reportingand better response times on those serversto make the game more stable again.
[ABEL] Yeah, it's good to hear also because everyonehas a different story about how they'readopting Backtrace into different partsof their workflow.
And for you all providing value from the retail aspectwas the immediate problem, but we can see that you canthen begin to think about how do you apply thisto dev, QA, and the like, and you get to pick and choosewhen you can do that.
So it may not be the immediateproblem that you need to, but you begin to see the value on the problemyou're trying to solve, and then you can kind oflet that grow on your workflow as necessary.
And it's alsoreally interesting to hear that it's not just the client side, but it's alsothe server side as well.
Games are incredibly complex, and it's not even those twosimple components anymore.
There could be multiplebackend components that actually are responsiblefor operating your game, whether it be everythingfrom the download and distribution to the actual gameplay itself.
As games get more complicated it also speaks to this fact of being able to roll outthis type of next-gen solution based on your own timelines being incredibly powerful.
So one thing I wantedto quickly do here is, this is actually not somethingApex Legends is doing today, as Fred mentioned, they're not linking to JIRA or necessarily sendingintelligent alerts, but this is somethingthey could easily do.
And the reason whythis is called out for is that even someone as bigas Respawn and Apex Legends, they're continuing to improvethe way they use crash reporting, but with something like Backtraceand actually having a purpose-built solution, this is something available to them whenever they want to roll it out.
And in fact, this could be somethingincredibly important to them when they do a massivepatch release for Apex Legends, being able to dointelligent alerts and the like.
But this is wherea good purpose-built solution and next-gencrash analysis solution lets you reap the benefits, and our team working on it with you guysshipping game features in the background.
So that concludes our conversation.
Fred, I really appreciate youjumping on the call and sharing your perspective.
It's been super helpfulas someone who doesn't have as much experiencedeveloping games as you do, it's good to knowthat crash analysis is super important, and that the kind of things that we put intoour next-gen system have driven a lot of value and solves pain pointsyou've been experiencing kind of acrossmultiple titles as well.
So thanks, Fred, for joining, we really appreciate it.
No problem at all, I'm happy to be part of something that promotes such a useful tool.
[ABEL] Great, so just two call outsbefore we end the call here: one, you can learn moreabout Backtrace at the URL shown here.
We are a Unity VerifiedSolutions Partner.
Additionally, we offer a free trial, so you can go tobacktrace.
io/create-unity/ and actually start using it today, which we encourage.
Thanks for you all for joining us.
We really appreciate your time, stay safe out there.
♪ [MUSIC] ♪.