Software engineering teams face many struggles, from the small problems during feature development, “Oh shit there’s no way for a user to change their last name…” or “How do I update a client’s old web page to autoplay in screen video with scrolling?” to much more dire problems, like having to fix a catastrophic failure real time (see the case study below on one company’s crash and burn). Or that last push with a bug fix only had a test for one failure mode, but doesn’t capture or reproduce all possible modes. Can anyone be sure if it really fixed the bug?
What is a Checklist
A checklist is a memory aid – a tool for eliminating failure. A “to-do” list jotted on the back of an envelope is an informal reminder of a list of actions that need to take place, maybe prioritized by urgency; a more formal checklist might have hierarchies for lists within lists in checkbox bullet form; an automated checklist could include fail safes to prevent a user from progressing in a script before a required step is completed. The more automation is possible, the better since it prevents the possibility of human error.
A Knight’s Fall
What happens when humans are left to their own devices to do unfamiliar or infrequent tasks? In the context of DevOps, we have the tragic tale of Knight Capital Group (link to story), a 400 million dollar market making company that went bankrupt in just “45 minutes of hell.” In 2012, Knight represented a significant portion of the market share on NYSE and NASDAQ, and their Electronic Trading Group (ETG) dealt in high volume trades – billions – every day. All was well until NYSE planned a launch of new program, prompting Knight to update their algorithm for routing orders. As part of this update, one fatal mistake occurred at the level of manual deployment.
“During the deployment of the new code, […] one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added. Knight had no written procedures that required such a review.”
SEC Filing | Release No. 70694 | October 16, 2013
The root cause was a malexecuted deployment, but taking a step back, you might say they put undue responsibility on the engineers deploying it. The ethical onus is greater on any entity whose dealings have such immediate and high-visibility leverage on the global economy as a whole. Knight’s failure to include automated processes as part of the software update, or for that matter, any documentation (EMERGENCY C/L, anyone?) on how respond to potential failure, belies any assumed responsibility for the weight of their role.
What can checklists do and what can’t they do?
In The Checklist Manifesto: How to Get Things Right, accomplished surgeon and New York Times bestselling author Atul Gawande conveys the importance of the simplest of any tool in a surgeon’s tool belt: the checklist. Checklists are used in fields ranging from disaster recovery and military to medicine and business. For high stress or time sensitive tasks, like emergency procedures, critical thinking and decision making skills may be impaired by the stress of the situation. Checklists are absolutely vital in these situations. How about for other, less important situations? Use a checklist wherever it seems appropriate, and don’t use one where it’s not.
Do you plan on implementing checklists at your workplace for preventing potential failures? What pitfalls have you experienced with the use of checklists, or lack of? Leave a comment.