> he system can't recover from an error in an individual flight plan, bringing the whole system down with it
From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.
The systems outside of the scope of this one failed to preserve a uniqueness guarantee that was depended on by this system. Was that dependency correctly identified as one that was the job of System X and not System Y?
> obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights?
Flights are tracked by radar and by transponder. The appropriate thing to do is just flag the flight with a discontinuity error but otherwise operate normally. This happens with other statuses like "radio failure" or "emergency aircraft."
It's not something you'd see on a commercial flight, but a private IFR flight (one with a flight plan), you can actually cancel your IFR plan mid flight and revert to VFR (visual flight rules) instead.
Some flights take off without an IFR clearance as a VFR flight, but once airborne, they call up ATC and request an IFR clearance already en route.
The system is vouchsafing where it does not need to.
The appropriate thing to do was to reject the flight plan (remember, the flight plan is processed before the flight starts, and anyways there were hours over U.S. in which the flight could have been diverted if manual resolution was not possible), not to let the flight continue with the apparent discontinuity, nor to shutdown the whole system.
Yes I agree. The reason the system crashed from what I understand wasn't because of the duplicate code, it was because it had the plane time travelling, which suggests very serious corruption.
> From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.
Flagging the error is absolutely the right way to go. It should have rejected the flight plan, however. There could be issues if the flight was allowed to proceed and you now have an aircraft you didn't expect showing up.
It seems fundamentally unreasonable for the flight processing system to entirely shut itself down just because it detected that one flight plan had corrupt data. Some degree of robustness should be expected from this system IMO.
It does not seem reasonable when you put it like that, but when could it be said with confidence that it only affected just one flight plan? I get the impression that it is only in hindsight that this could be seen to be so. On the face of it, this was just an ordinary transatlantic flight like thousands of others, with no reason to think there was anything unusual about it to make it more vulnerable than the rest - and really, there was not, it just had an unlucky combination of parameters.
In general, the point where a problem first becomes apparent is not a guideline to its scope.
Air traffic control is inherently a coordination problem dependent on common data, rules and procedures, which would seem to limit the degree to which subsystems can be siloed. Multiple implementations would not have helped in this case, either.
Shutting down a flight control system might have other knock-on effects on flight safety. Even if it merely only grounded flights not yet in the air, the resulting confusion might lead to manual mistakes and/or subsequent air lane congestion that might cause collisions.
I think you’re on the right track, I assume it’s safety.
If one bad flight plan came in, what are the chances other unnoticed errors may be getting through?
Given the huge danger involved with being wrong shutting down with a “stuff doesn’t add up, no confidence in safe operation” error may be the best approach.
I've worked on a (medical, not aviation) system where we tried as much as possible to recover from subsystem failures or at least gracefully reduce functionality until it was safe to shut everything down.
However, there were certain classes of failure where the safest course of action was to shut the entire system down immediately. This was generally the case where continuing to run could have made matters worse, putting patient safety at risk. I suspect that the designers of this system ran into the same problem.
There is no need to shut down the whole system just because of one flight plan that the system was able to reject. Canceling (or forcing manual updates to) one flight plan is a lot better than canceling 1,500 flights.
From the system's POV maybe this is the right way to resolve the problem. Could masking the failure by obscuring this flight's waypoint problem have resulted in a potentially conflicting flight not being tracked among other flights? If so, maybe it's truly urgent enough to bring down the system and force the humans to resolve the discrepancy.
The systems outside of the scope of this one failed to preserve a uniqueness guarantee that was depended on by this system. Was that dependency correctly identified as one that was the job of System X and not System Y?