discussion 2020-10-23 | Devops Enterprise Summit Slack Archive

inactive18:10:12

@allspaw I was talking with @mik about your awesome talk (now posted here! 🎉 http://videolibrary.doesvirtual.com/?video=467489131) He was relating to me a story that after a critical incident, @mik decided that he would no longer be on the Slack incident channels, because of the dynamics when he, as a senior leader, (CEO, Tasktop) was in the channel. Specifically, people refrained from sharing certain data because of how it might make other people look, and he wanted to eliminate this dynamic. Did you have similar experiences as VP Ops at Flickr, or CTO at Etsy, where you specifically removed senior people from certain incident forums? And why? (I am editing Part #2 of my interview with @david627 and @jessica.reif, and they mention how a similar dynamic is institutionalized — David mentions the famous picture of the Situation Room during the Osama bin Laden raid, which was probably taken right after the first helicopter crashed… here, the leaders are watching, but safely buffered from the operational units. He used the term “eyes on, hands off” to describe this dynamic.)

🎉 1

inactive18:10:54

Link: https://en.wikipedia.org/wiki/Situation_Room_(photograph) (IdealCast episode I’m referring to will be posted next week. Gotta finish up editing first!)

Beth Beese (Tasktop)18:10:28

@genek101 Reminds me of some themes from Jon Smart’s talk on leadership as well…

💯 3

inactive18:10:24

Oops. Meant to cc @erica.morrison and @scott.prugh as well! Oops. Meant to cc

John Allspaw19:10:09

Thanks, Gene. Did I have similar experiences? Yes.

John Allspaw19:10:54

https://twitter.com/allspaw/status/1177230162196410368?s=20

John Allspaw19:10:24

(including my own abstaining from joining the incident channel (slack/IRC) at Etsy when I was CTO) I frankly took too long to stop joining those chat channels.

Frotz Faatuai (Cisco IT - he/him)20:10:13

Cisco IT Enterprise Operations Center (our global NOC) separates Leadership (Management) from Technical to great effect. They run both incidents on different cycle times with periodic information reflection between the two spaces. This allows the Management track to stay informed and to use back-channel assistance to the technical teams if appropriate / necessary while allowing the Technical tracks to make forward progress. The trick that some of us have learned is that for high-profile Major Incidents, if the Technical track(s) know that the Management track needs a clear update, they will sometimes formulate that message in a given Technical track and it will likely be copied verbatim to the Management track. The typical separation point is Senior Managers, Directors and above are in the Management track. Line Managers might be involved in the Technical Track if invited by their technical engineers. Very infrequently, the Management track requests a direct Technical update in their space / their voice bridge. Usually those invited to do so are comfortable with skip-level communications, but are usually on their best behavior for obvious reasons (skip-level). Looking forward to that session.

Ferrix Hovi - Principal Engineering Avocado - SOK (S Group)20:10:23

Separating the management and technical tracks too well (honestly, some have two tracks with zero overlap or communication) is probably not helping. Pardon my stupidity, but why would you have two tracks. Isn't it a single incident? I can see how management abstaining can keep it blameless, open, whatnot but why do they have their own track. How are they helping?

John Allspaw20:10:04

We’ve come across lots of orgs where hands-on engineers (responding to the incident) will spin up parallel side-channel chat/bridge channels because tech leadership is present in the “official” channel. This then adds a 2nd workload for them, in addition to working through the incident.

🎯 4

😂 1

😭 1

Ferrix Hovi - Principal Engineering Avocado - SOK (S Group)20:10:49

I sympathise in not having management in the official channel. So, what is the management channel for?

John Allspaw20:10:24

the Blackrock3 folks call those (largely management) channels “Unified Command” (https://learning.oreilly.com/library/view/Incident+Management+for+Operations/9781491917619/ch05.html#idm139643632612656) - but @scott.prugh and @erica.morrison would be able to give their impression of it

👍 2

Frotz Faatuai (Cisco IT - he/him)20:10:28

@ferrix - When your CEO and your CIO are in the Olympics Opening Ceremony and it is Quarter End and something big goes wrong, you’re just going to have separate tracks. I think my description is more common in large organizations. Having the CIO know your name is a good thing. Having the CIO (4 levels above you) in a Major Incident asking for updates is intimidating.

Ferrix Hovi - Principal Engineering Avocado - SOK (S Group)20:10:02

I am of the opinion that incidents are situations where you need to put kids and management in front of a screen with some educational programme on to keep them out of the way. So, I get why you'd have the bread and circus. I still find it a trust issue.

Frotz Faatuai (Cisco IT - he/him)20:10:13

@ferrix This may have been a function of our ITIL practice several years ago. I run the monitoring platform as an internal service. The NOC is my primary customer as well as all other internal teams. In theory I stay up to date on what each of those teams want from me and especially the formalized procedures at play in the NOC. I think (guessing here) that a primary reason why they spin off the Management bridge is because someone needs to be held accountable (at those top levels). I don’t understand the politics there, but I’m told it is very political. I would guess that the NOC spins those off so that the Technical Teams have a safe-buffer from those politics. In general I’m not seeing / hearing of anything really political in the organization any longer, but that would be my guess.

Frotz Faatuai (Cisco IT - he/him)20:10:25

In general I think the current NOC / Major Incident and subsequent Problem Management space is very safe and it is obvious to all of us that Management strives hard to make that a safe playing field.

Ferrix Hovi - Principal Engineering Avocado - SOK (S Group)20:10:48

Oh, okay. Power games sound important when the future of the entire company is at stake. So far, I have told only three CEOs to fuck off during an incident, so I guess I am not easily intimidated. I'll take a note that a play pen is a viable option if that becomes a problem for the less courageous people around me.

Scott Prugh (DOES Prog Committee)21:10:13

As @allspaw mentions the Blackrock3 folks have some great training that helps mitigate a lot of these issues. The IMS/IC(Incident Command) model puts specific roles, cadence and rules of engagement in place. You have to truly want to adopt those and enforce them as part of your culture or it just won't work. As a leadership team we put a lot of value on this and trained all of our management and our executive leadership in the IC protocols and the roles.

👍 1

Scott Prugh (DOES Prog Committee)21:10:13

There are a few patterns that really help mitigate the effect of poor behavior: 1) Well defined incident roles, 2) Incident command training and certification, 3) Well defined behavior protocols, 4) Periodic CAN reporting, 5) Shared incident state document

Scott Prugh (DOES Prog Committee)21:10:24

We tried to bring together a bunch of patterns we have seen that help improve incident response and learn from incidents. Its far from perfect but hopefully a good start.

Scott Prugh (DOES Prog Committee)21:10:38

https://myresources.itrevolution.com/id006657105

Ferrix Hovi - Principal Engineering Avocado - SOK (S Group)21:10:07

Thanks @scott.prugh. I am still reading between the lines and not seeing a lot what the upper management would need to do other than staying out of the way and letting their recruits have at it. My mind is working in binary and I can see only that one is either contributing or not.

Scott Prugh (DOES Prog Committee)21:10:20

One of the root patterns is: Make incidents visible and part of daily work. Often incidents are treating as an exception with no preparation and only viewed as a failure. By normalizing them as part of what we deal with every day, we have to prepare and learn as a team.

Ferrix Hovi - Principal Engineering Avocado - SOK (S Group)21:10:06

So, rounding up the corner office to watch the show does not sound like normalizing.

Scott Prugh (DOES Prog Committee)21:10:08

@ferrix Staying out of the way is one thing. If it was that simple we could be done. But note that those upper management are answering to other irate stakeholders and the board, etc. They deserve timely information and updates. As the BR3 folks say: "Feed the information dragon". This is the role of the LNO in partnership with the IC. And if needed a roll-up to Unified Command.

Scott Prugh (DOES Prog Committee)21:10:55

Note the normalizing part was less about the executive interference problem and more about how we shift the thinking from incidents being part of what we prepare for everyday.

Scott Prugh (DOES Prog Committee)21:10:56

This video hopefully explains it: https://www.youtube.com/watch?v=CmfbyIWpMpw

Ferrix Hovi - Principal Engineering Avocado - SOK (S Group)21:10:04

I see. In the environments where I have been part of it, there have been two lines of status updates: the internal and the external with the latter going out directly by the comms folk part of the indicent task force. The internal communication would have more details. So, the upper management would still be in the loop like the rest of the organization hence be able to answer any angry calls or organize any management meetings necessary. Granted, those have been rather small companies but having an official top management call by default sounds pretty useless to me.

Scott Prugh (DOES Prog Committee)22:10:44

You don't need a top level management call by default. If the updates are on cadence and accurate then management is informed as are external parties. If the issue is really bad and very long you might need that top level call.

Frotz Faatuai (Cisco IT - he/him)22:10:25

(Sorry, working…) Mostly we don’t include top level management (VPs below the CIO) unless they need to inform the next level up. I suspect our use case is a function of our depth from top to bottom. Also, I could be inappropriately outsizing the invocation of the Management bridges because if things get that big, I tend to get dragged into the Major Incidents. I like the well defined roles callout from @scott.prugh Keep in mind that Cisco IT is some 9K people (30%/70% red/blue) vs our official 77K (blue-only) count. The scenario of screaming customers will tend to be other VPs who are trying to stay in front of their actual screaming customers. I think our last big one was last year.

inactive16:10:04

Sorry I missed it, @ffaatuai — what is definition of “red” vs. “blue?” Thanks!!

Frotz Faatuai (Cisco IT - he/him)20:10:23

Red == Contractors (color on their badge). Blue == Employees (color on their badge). Barely any reference to the Halo Red vs Blue 😉

2020-10-23

Channels