Monday, January 2, 2023

Southwest airlines and computers

Southwest Airlines garnered a lot of attention last week, A large winter storm caused delays on a large number of flights, a problem with which all of the airlines had to cope. But Southwest had a more difficult time of it, and people are now jumping to conclusions about Southwest and its IT systems.

Before I comment on the conclusions to which people are jumping, let me explain what I know about the problem.

The problem in Southwest's IT systems, from what I can tell, has little to do with the age of their programs or the programming languages that they chose. Instead, the problem is caused by a mix of automated and manual processes.

Southwest, like all airlines, must manage its aircraft and crews. For a large airline, this is a daunting task. Airplanes fly across the country, starting at one point and ending at a second point. Many times (especially for Southwest) the planes stop at intermediate points. Not only do airplanes make these transits, but crews do as well. The pilots and cabin attendants go along for the ride, so to speak.

Southwest, or any airline, cannot simply assign planes and crews at random. They must take into account various constraints. Flight crews, for example, can work for so many hours and then they must rest. Aircraft must be serviced at regular intervals. The distribution of planes (and crews) must be balanced -- an airline cannot end its business day with all of its aircraft and crews on the west coast, for example. The day must end with planes and crews positioned to start the next day.

For a very small airline (say one with two planes) this scheduling can be done by hand. For an airline with hundreds of planes, thousands of employees, and thousands of flights each day, the task is complex. It is no surprise that airlines use computers to plan the assignment of planes and crews. Computers can track all of the movements and ensure that constraints are respected by the plan.

But the task does not end with the creation of a set of flight assignments. During each day, random events can happen that delay a flight. Delays can be caused by headwinds, inclement weather, or sick passengers. (I guess crew members, being people, can get sick, too.)

Delays in one flight may mean delays in subsequent flights. Airlines may swap crews or planes from one planned flight to another, or they may simply wait for the late equipment. Whatever the reason, and whatever the change, the flight assignments have to be recalculated. (Much like a GPS system in your car recalculates the route when you miss an exit or a turn, except on a much larger scale.)

Southwest's system has two main components: an automated system and a manual process. The automated system handles the scheduling of aircraft and crews. The manual process handles the delays, and provides information to the automated system.

During the large winter storm, a large number of flights were delayed. So many flights were delayed that the manual process for updating information was overwhelmed -- people could not track and input the information fast enough to keep the automated system up to date.

A second problem happened on the automated side. So many people visited the web site (to check the status of flights) that it, too, could not handle all of the requests.

This is what I think happened. (At least, this makes sense to me.)

A number of people have jumped to the conclusion that Southwest's IT systems were antiquated and outdated, and that lead to the breakdown. Some people have jumped further and concluded that Southwest's management actively prevented maintenance and enhancements of their IT systems to increase profits and dividend payouts.

I'm not willing to blame Southwest's management, at least not without evidence. (And I have seen none.)

I will share these thoughts:

1. Southwest's IT systems -- even if they are outdated -- worked for years (decades?) prior to this failure.

2. All systems fail, given the right conditions.

One can argue that Southwest's system, a combination of automated and manual processes, could be redesigned to have more work handled by the automated side. It would require some way to track flights and record crews and planes arriving at a destination. Such changes are not trivial, and should be made with care.

One can argue that Southwest's IT systems use old programming techniques (and maybe even old programming languages), and Southwest should modernize their code. I find this argument unpersuasive, as newer programming languages and code written in those languages is not necessarily better (or more reliable) than the old code.

One can argue that Southwest's IT system could not scale up to handle the additional demand, and that Southwest should use cloud technologies to better meet variable demand. That is also a weak argument; moving to cloud technologies will not automatically make a system scalable.

Clearly this event was an embarrassment for Southwest, as well as a loss of some customer goodwill. (Not to mention the expense of refunds.) Given that a large winter storm could happen again (if not this year then possibly next year), Southwest may want to make adjustments to its scheduling systems and processes. But I would caution them against a large-scale re-write of their entire system. Such large projects tend to fail. Instead, I would recommend small, incremental improvements to their databases, their web sites, and their scheduling systems.

Whatever course Southwest chooses, I hope that it is executed with care, and with respect for the risks involved.

No comments: