This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2020-06-23
Channels
- # ask-the-speaker-track-1 (171)
- # ask-the-speaker-track-2 (401)
- # ask-the-speaker-track-3 (250)
- # ask-the-speaker-track-4 (194)
- # bof-arch-engineering-ops (3)
- # bof-covid-19-lessons (9)
- # bof-cust-biz-tech-divide (8)
- # bof-leadership-culture-learning (19)
- # bof-next-gen-ops (46)
- # bof-overcoming-old-wow (8)
- # bof-project-to-product (10)
- # bof-sec-audit-compliance-grc (9)
- # bof-transformation-journeys (52)
- # bof-working-with-data (33)
- # discussion-main (885)
- # games (335)
- # happy-hour (129)
- # help (411)
- # hiring (43)
- # lean-coffee (17)
- # networking (8)
- # project-to-product (1)
- # snack-club (44)
- # sponsors (77)
- # summit-info (437)
- # xpo-datadog (2)
- # xpo-digitalai-accelerates-software-delivery (28)
- # xpo-github-for-enterprises (25)
- # xpo-gitlab-the-one-devops-platform (25)
- # xpo-itrevolution (3)
- # xpo-launchdarkly (3)
- # xpo-pagerduty-always-on (1)
- # xpo-planview-tasktop (6)
- # xpo-slack-does-devops (13)
- # xpo-snyk (5)
- # xpo-sonatype (12)
- # z-do-not-post-here-old-ask-the-speaker (176)
So good to see your face on the screen, my friend @fernando.cornago!
IT --> Technology is a good rename. How did that come about?
IT --> Technology is a good rename. How did that come about?
โฆ or should I have said @fernando.cornago442? Thereโs two of you!
That was actually a proposal of our CIO last year. Sometimes you need outside in persepective. Or just rename to create a reboot.
managing IT to enable biz with tech < reminds me of the CAMS blog post from @patrick.debois256 way back in 2012 http://www.jedi.be/blog/2012/05/12/codifying-devops-area-practices/
Great start there @fernando.cornago442
โYou donโt want to be the bottleneckโ < words to live by!
We have a very lightweight tech request process if you want/need to deviate. But actually we make it very appealing to use the Platform.
So short answer no, but longer answer is: better come with a good reason if you want to deviate.
Thanks that makes sense. At Fannie Mae we started with one single CI/CD platform but we are stating to offer 2 solutions as we move to the cloud.
Yep, that is usually also happening over here that we have 2-3 choices globally you can pick from. We don't believe in one-size-fits-all.
but after 2 years we made so easy to use some of the offering and invested so much into onboarding and training that doesn't make a lot of sense
we even had issues by implementing serverless at scale as we made it very easy to jump on K8s
@fernando.cornago442 and @daniel.eichten in terms of platform evolution, do you have any structures to help the platform teams build the platforms which their customers actually need, rather than the platform which the platform team think they need (or want their customers to need)?
I bumped into Adidas DevOps Maturity framework a few days ago: https://github.com/adidas/adidas-devops-maturity-framework. Are you actively using it and what kind of experiences you have had with it?
I bumped into Adidas DevOps Maturity framework a few days ago: https://github.com/adidas/adidas-devops-maturity-framework. Are you actively using it and what kind of experiences you have had with it?
we have an agile maturity framework, that at the end links to the DevOps one for the tech teams
can you share more how you linked both. we have currently 2 separate teams (1 for Agile, 1 for DevOps) which are developing there own.
Thanks! I found the framework very valuable myself. It's always easier to start own work when someone else has done the groundwork ๐
Hi @Fernando! In the Dojo slide it showed youโre migrating from Quay to Harbor - Iโd be really interested in the reasons for that?
Hi @Fernando! In the Dojo slide it showed youโre migrating from Quay to Harbor - Iโd be really interested in the reasons for that?
Simple reason: global reach. http://quay.io wasn't available in China mainland. ๐
yep, harbour we install as a sidecar of all our stack (K8s, Kafka, Kong, Jenkins aaS...) in private and public
Is there a similar reason you chose Harbor over ECR?
Did you consider running your own instances of Quay? (Iโm interested because they are submitting to CNCF)
@liz actually not really we looked into harbor, and http://port.us the point in time. http://Quay.io was only available on-prem as enterprise subscription and that was provided by our k8s partner giantswarm.
But I can connect you to the team who did it that time and what their current point of view on the registries are.
We are also using DOJOs to help the teams learn through immersive learning working on real product backlogs with the help from Product and Techn coaches who have "hands of the keyboard" vs "hands of the whiteboard" skills.
(Iโm loving the fact that @fernando.cornago442 is answering all these questions in near real time โย thank you Fernando!!!)
(Iโm loving the fact that @fernando.cornago442 is answering all these questions in near real time โย thank you Fernando!!!)
iโm stressed out for Fernando and Daniel
Hey @genek101, I was bummed I missed your session @ Fannie Mae due to vacation.
have you shared your RCA process anywhere @fernando.cornago442 @daniel.eichten?
have you shared your RCA process anywhere @fernando.cornago442 @daniel.eichten?
That would be great! I find it really interesting to compare the process from organization to organization.
if you ask me, our vision is that it stays central for the 4 or 5 most critical value streams of the company
and at the same time set up the standards for release/incident and problem managmenet of hte company
So SRE teams are more of enablers and help other teams with standards/advisory.
that's the vision, for the time being they still hold a lot of ops resonsiibility
now they keep track of the tams taking over ops for themselves and the ones not doing it
I thought so. I had a similar setup at few clients and we introduced โswarmingโ and embedding SRE as part of our teamโs platform dev guidelines. It worked well at some places where SRE became everyoneโs business. Some places, it is still maturing. Thanks for the details.
@fernando.cornago442 Great to see QA Strategy is a key to succeed. What is the main point to focus on in a renewed QA Strategy?
@fernando.cornago442 Great to see QA Strategy is a key to succeed. What is the main point to focus on in a renewed QA Strategy?
E2E Quality assurance, with shift left and shift right? And I loved to see that you focus on Exploratory testing next to automation
the rest... the teams work together as a single team for the time of engagement
outcome is value for the receiving team and updates on practices for the platform team
I can put you in contact with our Lean Delivery leader that is the most experienced one with that
Hi @fernando.cornago looking forward to know more about how to put in place DOJO practice and ways to organize it.
@fernando.cornago442 What's the general make up of your teams? Are they multi-discipline? What's the (average) ratio of qa to dev?
@fernando.cornago442 What's the general make up of your teams? Are they multi-discipline? What's the (average) ratio of qa to dev?
that work for our team as well as help the rest of the QA leaders of the company with QA platforms and practices
though I guess that means you might end up with questions all week!
@fernando.cornago ย how do you cope with this level of transparency within your organisation?
@fernando.cornago ย how do you cope with this level of transparency within your organisation?
it depends on the culture and the current level of bureaucracy in the organization, how far you can go with transparency increase ๐
@fernando.cornago442 do you have transparency on cloud cost consumption by team and service?
Iโm reminded of that โgreat artists stealโ quote
I will ask my โhow do you guys measure Psychological Safety?โ question later then @fernando.cornago442 and @daniel.eichten when you guys caught your breath ๐
I will ask my โhow do you guys measure Psychological Safety?โ question later then @fernando.cornago442 and @daniel.eichten when you guys caught your breath ๐
@me1342 not really directly. We measure out employer NPS which can be an indicator.
True it can if youโre forensic about engagement but they can love you and not necessarily be as magical as they can be so happiness if good but but itโs not all of PS by far - we also measure flexibility,courage, speaking up, negative behaviours etc - anyhow happy to chat donโt wanna monopolise you. @nuwayser we make one and seen some in-house ones in banks too
Have you seen https://engineering.atspotify.com/2014/09/16/squad-health-check-model/ ? Weโve had luck with this framework, and you could easily extend it to add a question about Psychological Safety.
What was in that sandwich?! noodles, pickles, flat meat? (apologies for the off-topic question - it can wait ๐)
What was in that sandwich?! noodles, pickles, flat meat? (apologies for the off-topic question - it can wait ๐)
https://pixabay.com/de/photos/di%C3%A4t-kalorienz%C3%A4hler-gewicht-verlust-695723/ source of the picture. ๐
it makes sense if you think of a Git Repo with all your definition in, say Terraform or CloudFormation, and the PR merge will apply the change
the trick, IMHO, is to have a repo just for the security/account stuff well, I love Terraform and have some ideas of how to implement this
Bitbucket is integrated into Azure AD for SSO. And then a linked account request is basically one file. Yaml/Json depending on your preference. If you wanna change you only change the file. PR goes in and the guys sitting on top of aws review. Rest is then automated filling terraform templates or CDK scripts.
Well was a journey. Started with TF only, went to CloudFormation, back to mix of both. Now all Terraform and some CDK.
To add on top - in the data mesh implementation we do in adidas, permissions to get access to the data will also goes through git: PR from a subscriber as access request tool, approved by the data owner. Self service from that perspective. Rest - is automation.
thx @dmitry.luchnik and @daniel.eichten I would love to do something similar but I see lots of org issues in our company Looking forward for insight on the org/social issue in the next days
@fernando.cornago it'd be great to understand what specific technologies/tools are you using to measure the business value of sw engineering/sw delivery?
@fernando.cornago it'd be great to understand what specific technologies/tools are you using to measure the business value of sw engineering/sw delivery?
well... in our Global Metrics Portal is an open field where each team can put a value
in our case.... we automate the metrics of adoption technically (number of jobs in jenkins, number of apis, etc.) measured by the areas using them
and sorry, Global Metrics Portal is our tool collecting metrics from JIRA, GIT, Jenkins, Sonar, ServiceNow....
@daniel.eichten Can you share more about the watchdog strategy you mentioned on the Azure and GC useage? Is that something you have as code and might be even able to share this code?
@daniel.eichten Can you share more about the watchdog strategy you mentioned on the Azure and GC useage? Is that something you have as code and might be even able to share this code?
Right now we make use of Dome9 with custom rulesets and extract percentages on how compliant accounts are to the rulesets. That is being read out via API and put into the metrics portal.
What led to the decision that every 6 sprints is the right interval for the cleanup work to improve velocity?
What led to the decision that every 6 sprints is the right interval for the cleanup work to improve velocity?
In the past we usually parked 15% capacity of the teams for technical debt and cleanup acitivities. So I'd say its something that developed over time. Although it was hard to convince some biz colleagues in the beginning most of them see the benefit in the meantime.
Does your platform provide the agile teams with low leve software engineering metrics and trends i.e. Commit frequency, build frequency, failure rates, test execution, etc?
Does your platform provide the agile teams with low leve software engineering metrics and trends i.e. Commit frequency, build frequency, failure rates, test execution, etc?
Yes, correct! We have all the low level metrics and are increasing now piece-by-piece as well as aggregates.
@fernando.cornago442 You've mentioned something about improving developer productivity. What metrics are you using for that?
@fernando.cornago442 You've mentioned something about improving developer productivity. What metrics are you using for that?
@fernando.cornago can you explain what's the goal of Licence to automate program ? Who is managing the program ? (is this the marketing team ?)
@fernando.cornago can you explain what's the goal of Licence to automate program ? Who is managing the program ? (is this the marketing team ?)
Program was initiated by the automation team in Platform Engineering. So rather technical. But they calculate together with their biz partners the RoI.
@fernando.cornago442 Thank you for the presentation! I'm curious, how do you gather the metrics for your performance? I'm trying to do something similar for my company as well
So is Vegas in person for sure? ๐
So is Vegas in person for sure? ๐
Better be! Getting airport-lounge withdrawals at last
In the future all talk titles will be emoji based
@james839 @victoria_mayo are here to answer your questions!!
@fernando.cornago @daniel.eichten - do you have formal Product Managers for your platform products or do your Engineering Managers lead that function?
@fernando.cornago @daniel.eichten - do you have formal Product Managers for your platform products or do your Engineering Managers lead that function?
No real formal product managers. Engineering Leads do this in role-union. As they are internally usually speaking engineer to engineer that works pretty well.
thanksโฆ. its a continued debate at Target. I find that if the leaders are customer focused and treat the engineering teams they are enabling as a customer with choices, then it works.
Hi @fernando.cornago! I also work for a German company (http://XING.com) and we are also doing our DevOps transformation right now and a lot of what you said totally resonated. I have a very specific Cloud question to @daniel.eichten: We know that in Germany there is a very present PR-related cloud oposition. How was the cloud movement? Was it driven bottom-to-top? What success stories can I pitch as a simple team-lead and Cloud enthusiast to help my VP pitch cloud adoption to our C-level?
Hi @fernando.cornago! I also work for a German company (http://XING.com) and we are also doing our DevOps transformation right now and a lot of what you said totally resonated. I have a very specific Cloud question to @daniel.eichten: We know that in Germany there is a very present PR-related cloud oposition. How was the cloud movement? Was it driven bottom-to-top? What success stories can I pitch as a simple team-lead and Cloud enthusiast to help my VP pitch cloud adoption to our C-level?
@eduardo.escarti Cool! Send some greetings to Jens Pape please. On your question: yes I can echo this but it usually also differs to what companies you talk to. Big global enterprises like us or BMW are usually quite open to adopting public cloud. Super small startups as well. What I typically see is a oppostition in the small to medium sizes โ sometimes family run โ companies.
And don't get me wrong there are also good reasons to not go to the cloud. Although I'd say data privacy shouldn't be on the list. You just have to do it right. This even resonates in the fact that the german BaFin now allows banks and insurances to make use of public cloud.
Sure will send Jens your greetings ๐ Very nice guy. Yeah, youโre probably very rightโฆ XING being 100% DACH focused now, maybe falls into some โculturalโ - pitfalls. Thanksโฆ BaFin seems like a good argument to pitch ๐
I am also in Germany. In the regulated Healthcare domain working for Siemens Healthineers. At the Digital Health business unit, lots of our products run on the Microsoft Azure Cloud.
Runs on Azure: https://www.siemens-healthineers.com/digital-health-solutions/digital-solutions-overview/service-line-managment-solutions/teamplay
Runs on Azure: https://www.siemens-healthineers.com/digital-health-solutions/digital-technologies/teamplay-cloud-platform
Runs on Azure: https://www.youtube.com/results?search_query=siemens+healthineers+teamplay
@eduardo.escarti I can only mirror your challenges we face the Same problem at TUI especially in Germany. We proved it in the UK (where we had senior buy-in) and then we got some of the germans to evangelise it in the teams.
Really interesting @christian.rudolph. Seems to be โa thingโ in Germany. Do you see results already? I think teams could be easily motivated to go cloud, but we faced sometimes slow-downs due to trying to use OnPrem practices with cloud
Yes we see results and now we have a movement for things to move over. However if we operate it like On-prem we fail a lot of time, based on costs comparison. If we then optimize together with our Enabler teams we see better results. So training and review of tools/practices are very important. However I can only encourage to take a small cost bump through liftโnโshift instead of trying to do it directly correct.
Ca you provide a few more details about the Org change for inverse Conway @fernando.cornago442
@james839 Can you give some info about how you setup support process at IptiQ? I understand that you had a greenfield, right?
@james839 Can you give some info about how you setup support process at IptiQ? I understand that you had a greenfield, right?
we were able to leverage the group for workplace, productivity tools and corporate services
How do you deal with support requirements (e. g. 24/7 you build it, you run it) and autonomous product teams?
How do you position the teams in regard to your support organisation (if you have any)?
@james839 โtrying to change mindsets and getting nowhere fastโ?!? Get outta here! Unheard of ๐
@james839 โtrying to change mindsets and getting nowhere fastโ?!? Get outta here! Unheard of ๐
Also @daniel.eichten do you transparently communicate costs of their infra to the teams using it? Is that some KPI they take a look regularly?
Also @daniel.eichten do you transparently communicate costs of their infra to the teams using it? Is that some KPI they take a look regularly?
This year there is a mix, but from next year the platform team actually is aimed to have 0โฌ budget. So that means every product team is being recharged for the services they use.
And yes: we had some surprise moments in the past when a half-years-cloud budget is burned in one month cause someone spun up the biggest managed Oracle DB you can find in the catalog. ๐
We got also funny things like โHmmm why is this Sandbox account no-one is using burning 1k a month?โ Literally burning money
For me being really Data-Center heavy company and having the โitโs cheaperโ argument always waved for PRO-On-PREM I find this cost-focus really interesting to demistify
โgoing nativeโ sounds much like โjoining the rebellionโ :)
โgoing nativeโ sounds much like โjoining the rebellionโ :)
I like how you guys are 10 mins into it and still on how you created the team
Build vs Buy is something I'm looking to achieve, would be interested how you convinced the executive team. Mine is very rooted in Buy
Build vs Buy is something I'm looking to achieve, would be interested how you convinced the executive team. Mine is very rooted in Buy
Build was important from the outset but that work was done by our iptiQ Life and Health colleagues who came about 6 years before us. Swiss Re wanted to create that access and create it greenfield, but the buy option can always remain on the table as part of inorganic growth.
this makes sense thanks. I'm looking to identify which are the products that really differentiate us from our competitors and look to build there. Buy for standard services such as the ones you meantion (HR, Fianance, etc)
maybe have a look at wardley mapping it gives some nice ideas about technical evolution and can help focus the mind on where to build v's buy
This is so great, @victoria_mayo and @james839! @victoria_mayo โย I meant to ask: what does a law degree let you do? Were you ever a practicing lawyer, and if not, how close were you?
This is so great, @victoria_mayo and @james839! @victoria_mayo โย I meant to ask: what does a law degree let you do? Were you ever a practicing lawyer, and if not, how close were you?
I was indoctrinated into Swiss re from the outset and never had my heart set on being a lawyer, so not a qualified solicitor in the UK. Maybe one day for a new challenge ๐
@john.booth have you tried Wardley mapping? https://medium.com/wardleymaps
@john.booth have you tried Wardley mapping? https://medium.com/wardleymaps
A great intro at last yearโs SEACON https://youtu.be/L3wgzl2iUR4
https://devopsenterprise.slack.com/archives/CB0JGND0C/p1592902578105800
Ah, this is the clip that eternally cuts in the middle of my binaural sounds on YouTube during a Pomodoro :the_horns:
โletโs not destroy a 100+ year old organization before we even launchโ ๐
"to be innovative, you are always going to be one step ahead of regulators" Absolutely
Agile methodology โ came with too much agile evangelism. ๐ Always.
โweโre agile โย we donโt need milestones. but we WANTED milestones.โ (Sorry, I know Iโm supposed to ask questions, but itโs so GOOD!)
Sacrilege! โToo much Agile evangelism!โ
"Agile religion to the letter, not the spirit" @victoria_mayo
This is giving me an Aha revelation that I don't hear much conversations from my compliance team outside of the "how". What are some ways or types of questions I can use to understand the "why"? Do you have suggestions on how to do this proactively instead of only when there is something urgent?
This is giving me an Aha revelation that I don't hear much conversations from my compliance team outside of the "how". What are some ways or types of questions I can use to understand the "why"? Do you have suggestions on how to do this proactively instead of only when there is something urgent?
I also think a lot is down to engagement with compliance and legal. If people only speak to them when itโs urgent, they will learn to only give urgent and often conservative advice. Bring them into the business meetings, let them get the context. People can be nervous to do so, but theyโre not there to be the police! Theyโre there to support. We only get paid if the business does well after all ๐
Loved the 'applying the letter vs the spirit'. Too many going through the motions without the values and principles.
@genek101 @jonathansmart1 is a brave man, many know where he lives
โฆI loved that confession about โyeah, I was pushing agile forms too muchโ from @james839 โย soo good!
โฆI loved that confession about โyeah, I was pushing agile forms too muchโ from @james839 โย soo good!
i have remind my teams all of the time about the difference between โbeing Agileโ vs โdoing Agileโโฆ usually includes a โi dont give a f* about how long your sprints are of if you had the grooming meeting or notโโฆ all in good spirits
I appreciate @victoria_mayo pointing out the challenge that a diverse teams brings. Diversity should be a strength, however if the people donโt know how to discuss their differences then the team can end up with low psychological safety.
I appreciate @victoria_mayo pointing out the challenge that a diverse teams brings. Diversity should be a strength, however if the people donโt know how to discuss their differences then the team can end up with low psychological safety.
If they refrain from even trying itโs worse
Completely agree. Diversity should be strived for, but be prepared to spend time to manage it from the outset
โnumerous warnings; DONโT TOUCH ANYTHING IN PRODUCTION!โ ๐
โnumerous warnings; DONโT TOUCH ANYTHING IN PRODUCTION!โ ๐
we should introduce testing in production in that case :rolling_on_the_floor_laughing:
PS: I have a new appreciation for how terrifying these launch events โย it was one of my favorite scenes in Unicorn Project. But the first 30m of this online conference wasโฆ a bit rocky. THANK YOU for bearing with us!! Thanks to @sam and @patrick.debois256 and the ITREV team for getting us through this!
Don't breath on those apps or servers. How many times have legacy technology that has been disregarded for years become an anchor to your ability to deliver fast and safe?
Don't breath on those apps or servers. How many times have legacy technology that has been disregarded for years become an anchor to your ability to deliver fast and safe?
Conversely: sometimes those reliable, well-established systems and ways of working are the anchor you need to deliver fast and safe changes on top
(we did a lot of COVID-related change here at Nationwide on the back of COBOL-based mainframe systems and existing digital journeys -- well known, reliable and safe -- with great well-embedded teams)
@victoria_mayo said the key to gaining traction was the diverse executive team finally gelling together... could you elaborate on that please
@victoria_mayo said the key to gaining traction was the diverse executive team finally gelling together... could you elaborate on that please
Itโs hard when itโs a diverse team, especially in terms of โthought diversityโ and even more so when youโre starting greenfield. Everyone has a different view and even vision. Once we cemented the short and long term goals and we learnt to communicate effectively, the machine really clicked into place.
Getting a shared vision...was that a matter of offsites etc? Leadership selling it in or something more collaborative?
Both. We have this concept of going from a village to a city. Communicating insurance business goal context to a wide, mostly tech centric company , has been a learning process.
We had A LOT of offsites. I think theyโre a great tool, but always interested in other ways as sometimes you can get offsite fatigue!
Official unicorn? Who grants that one?
Official unicorn? Who grants that one?
I bet there is a Chinese vendor for great-looking badges and trophies.
And their potentially western competition
@victoria_mayo: can you send my a copy of your slides via snail mail? ๐
โCan you please send me my policy statement by mail?โ (ah, the non-digital natives! ๐ )
โCan you please send me my policy statement by mail?โ (ah, the non-digital natives! ๐ )
no actors where used in the making of this advertisement !!:rolling_on_the_floor_laughing:
Depends on your location and the rules there, some people have chosen to come into the Zurich office occasionally, but otherwise yes, everyone has been doing 100% remote!
Really great job @victoria_mayo and @james839 both in building an InsureTech challenger but also in building with the WoTNotWoW Agile bit at DNA level -and in keeping this about the EQ of it-
Really great job @victoria_mayo and @james839 both in building an InsureTech challenger but also in building with the WoTNotWoW Agile bit at DNA level -and in keeping this about the EQ of it-
@victoria_mayo @james839 how far is iptiQ in terms of direct management from swiss RE management? is it more like a startup within a large company or a department within a large company? how supportive was the swiss RE management in terms of innovations? I saw the cases when the top management was initially all in for innovation but only later realized that it might "break" the existing culture and was reluctant to implement innovation thanks for a great talk!
@victoria_mayo @james839 how far is iptiQ in terms of direct management from swiss RE management? is it more like a startup within a large company or a department within a large company? how supportive was the swiss RE management in terms of innovations? I saw the cases when the top management was initially all in for innovation but only later realized that it might "break" the existing culture and was reluctant to implement innovation thanks for a great talk!
Itโs very close! So our iptiQ CEOโs boss reports directly to the Group Swiss Re. And our Group CEO is very passionate about iptiq. I can say our group CEO is very keen to disrupt the reinsurance culture, including Swiss reโs own. We can be a bit elephant like - reliable but slow. IptiQ allows for genuine agility so so long as our underwriting remains sharp, top management is very supportive
That "friendly VC" model can be super-powerful if the team is really given freedom to operate e.g. using different tools from the parent company
@liz right, feel like the best way to implement changes (especially in a regulated niche)
@james839 @victoria_mayo thanks again, might be a great case for other companies
the security and budget from a big corp but with autonomy on tech, ways of working, hiring etc.
@victoria_mayo It was really interesting to hear from a legal and compliance perspective. Thank you for the talk.
@victoria_mayo It was really interesting to hear from a legal and compliance perspective. Thank you for the talk.
As someone who's had to deal with pushback from developers on whether or not to fix bugs, I love it when I'm working on a compliance product - it's comforting when you've got compliance on your side to talk to devs as to why it needs fixing!
Bug fixing has become a daily part of my job that I never expected coming from Reinsurance!!
Amazing presentation, @victoria_mayo and @james839!!! PS: hereโs a talk from a legal/compliance executive at Nike! Let me know if I can introduce you to Anne Bradley! https://www.youtube.com/watch?v=13C95oShKgQ&t=4s > DOES18 Las Vegas โ Leveraging the power of a matrixed organization to solve problems and build solutions bigger than the individual. > > Build a Bigger Team - Nike > > Anne Bradley, Chief Privacy Officer and Global Counsel for Nike Direct > Courtney Kissler, Vice President, Nike Digital Platform Engineering
Amazing presentation, @victoria_mayo and @james839!!! PS: hereโs a talk from a legal/compliance executive at Nike! Let me know if I can introduce you to Anne Bradley! https://www.youtube.com/watch?v=13C95oShKgQ&t=4s > DOES18 Las Vegas โ Leveraging the power of a matrixed organization to solve problems and build solutions bigger than the individual. > > Build a Bigger Team - Nike > > Anne Bradley, Chief Privacy Officer and Global Counsel for Nike Direct > Courtney Kissler, Vice President, Nike Digital Platform Engineering
And hereโs some fantastic auditors from Big Four from panel session last year โย they were from audit/assurance practice, not the consulting side. Theyโre so great, too! https://www.youtube.com/watch?v=iiQY9qiDQCE&list=PLvk9Yh_MWYuwXC0iU5EAB1ryI62YpPHR9&index=15&t=174s > Matt Bonser, Director, Digital Risk Solutions, PricewaterhouseCoopers LLP > Yosef Levine, Managing Director, Global Technology Controls, Confidentiality & Privacy, Deloitte > Jeff Roberts, Senior Manager, Advisory Services, Ernst&Young > Michael Wolf, Managing Director Modern Delivery Lead, KPMG
It is actually impossible not to smile when @steve773 starts speaking ๐
@genek101 you talked a lot about community, where does the community persist outside of the summits? Is there a permanent home for exchanging ideas?
@genek101 you talked a lot about community, where does the community persist outside of the summits? Is there a permanent home for exchanging ideas?
This Slack stays open all year round but it tends to be pretty quiet outside of conferences. Maybe we could use it more?
I was gonna say - found resources on here from a couple of years ago - itโs not going anywhere
For sure โย the Slack instance stays open. Iโm open to any ideas you have on how to enable people to engage with each other, help each other, etc., outside the conferences. cc @jeff.gallimore, who can help operationalize it. ๐
Agreed... pretty dead between conferences... love to find a why to keep the flow going...
Barry at SEACON conference keeps conversations going by having occasional sessions outside of the actual conference itself. Maybe some DOES curated virtual sessions (Zoom? Slack? Something on this platform?) more frequently than once / twice a year?
Yep DataScience festival does something similar with meet-ups to share real world learnings in between major events
Stealing this one for the next virtual keynote
not your entire presentation! The joke :face_with_hand_over_mouth:
PS: I told @steve773 after his recording how mind expanding it was โย I learned so much about how to use the medium of the โsmall screen.โ Like Bill Nye the Science Guy! ๐ You can see some adjustments I made to my own recordings after watching him!
Ship on the left looks like old school โMan of War.โ More and more guns bristling from the side of the ship.
I still canโt believe the tech changes given that thereโs only 3 years of difference in those ships!!!
Finally some light swearing #FFStechConf we are not
If the U.S Government can change things on how they design their ships in just 3 years to better serve their mission. Then we all should be able to change our organization's to better serve the organization's mission with DevOps. It just takes people with the idea and the drive to keep push for better outcome for everyone.
If the U.S Government can change things on how they design their ships in just 3 years to better serve their mission. Then we all should be able to change our organization's to better serve the organization's mission with DevOps. It just takes people with the idea and the drive to keep push for better outcome for everyone.
They just need a lead from the Brits, 5 years earlier..
Related, on a more pessimistic note: https://timharford.com/2019/12/cautionary-tales-ep-6-how-britain-invented-then-ignored-blitzkrieg/
Thats the whole point of the conference right. Learn from other's success and failures and then put it in to practice for what works for you.
Actually reading about it, it was a parallel development... John Ericsson a Swede introduced it in the Monitor...
Iโve got a feeling this keynote is likely to come up in the #bof-leadership channel laterโฆ
Excellent keynote! We are struggling with centralisation vs de-centralisation debate. Would be good to understand how you break down problems against strategy
@anand.patil Most of my knowledge is from movies ๐ But seriously, Iโve got a riff on how โThe Martianโ demonstrates โrecursiveโ pre mortemโ to show how to go from huge problem to individual pieces (and then piece them back again).
You can share lessons learned also through Blameless Postmortems. We do host a monthly session on the last Friday of the month called Blameless Fridays where teams come share learnings from recent failures in Production. Stole this idea from the Google SRE practices.
You can share lessons learned also through Blameless Postmortems. We do host a monthly session on the last Friday of the month called Blameless Fridays where teams come share learnings from recent failures in Production. Stole this idea from the Google SRE practices.
As long as you donโt push it too far and start toying with Failure awards and rewards like l saw a team do ๐ต
I have a very positive take on the notion of โfailure awardsโ (if I understand your meaning of it)โฆso much that as CTO we brought a great deal of productive attention to it: https://www.infoq.com/articles/crafting-resilient-culture/
โthe information will be parsed and go to the right placesโ < sounds like Westrumโs generative culture
Technology choice decisions - Should it be a EA led process or should it be a decision made in the trenches??
Technology choice decisions - Should it be a EA led process or should it be a decision made in the trenches??
A bit of both - see my talk later at 11:55 on EA :)
(Also happy to chat more in detail - my 1/2 hour is a bit of a blistering pace run through lots of modern EA)
Think movies vs history have you tried https://www.amazon.co.uk/Most-Dangerous-Enemy-History-Britain/dp/1854108018
Make failures an enterprise learning opportunity not a shaming one. Creates a safe culture where problems are not shoved under the rug.
EA should set Enterprise guard rails but leave a decent level of local autonomy.
EA should set Enterprise guard rails but leave a decent level of local autonomy.
Yep โfreedom in a frameworkโ was a great eye opener for me. thinking about how you can do โimprovโ within the guardrails
@jose_mingorance @richard431 So, The Martian was a deconstructive pre mortem. โAstronaut Markโ wonโt get home. Why not? No food no shelter no communication no transportation. Why no food? didnโt pack enough for more than a short period and donโt have fresh food. Why not? no soil, no nutrients no water, etc. Why not? โฆ OH! We can fertilize soil. We can burn fuel and make waterโฆ
More good lessons from the military: https://www.goodreads.com/book/show/22529127-team-of-teams and https://www.goodreads.com/book/show/16158601-turn-the-ship-around
More good lessons from the military: https://www.goodreads.com/book/show/22529127-team-of-teams and https://www.goodreads.com/book/show/16158601-turn-the-ship-around
@akis.sklavounakis David Silverman, co-author of Team of Teams, will be talking tomorrow. @steve773 and I have had great fun talking with him.
I want an "aha" emoji... was in need of it several times already
@stijn.claes Thanks. Several pots of coffee coming into the recording. Keeps the energy high and keeps the duration short ๐
heh. I just read a blog post on this mistake, deciding in advance, on Friday, this time using Sauruman and LotR: https://acoup.blog/2020/06/19/collections-the-battle-of-helms-deep-part-viii-the-mind-of-saruman/
heh. I just read a blog post on this mistake, deciding in advance, on Friday, this time using Sauruman and LotR: https://acoup.blog/2020/06/19/collections-the-battle-of-helms-deep-part-viii-the-mind-of-saruman/
@steve773 the projection in the Pacific didn't started at the end of XIX with Philippines?
@stijn.claes - agree! nice talk by @steve773. The talk reminds of a couple of CCRP papers - โNetwork Centric Warfareโ, โPower to the Edgeโ, and โThe Agile Organizationโ.
Props!!!๐ช:skin-tone-2:
@ciaran.byrne Thanks. Weโre going through โdecidedโ with Covid, right? People trying to โplanโ for September as if there is one possibility as opposed to โprepareโ for the possible Septembers we might encounter.
life will throw you situations you donโt have the answer on.... love the quote @steve773!!
AFAIK the firing of someone with a novel idea happened also in US i think around 2002 - MC02 with Lt Gen Van Riper being fired and the war games being restarted
That reminds me of a quote: "Everyone has a plan until they get punched in the face" @jose_mingorance
@steve773 There is a lot of good takeaways about how to help lift issues up to leadership. Do you have suggestions on how to build the knowledgebase horizontally and learn from other teams?
@steve773 - COVID could be a new reference, alternative to famous war battles? Learning from distributed experimentation (enabled by technology) vs. a battle plan?
@steve773 - COVID could be a new reference, alternative to famous war battles? Learning from distributed experimentation (enabled by technology) vs. a battle plan?
So many experts and vested interests vying for control... and top-down structures having a field day...
It is! I genuinely had no idea that was from Mike! The things you learn at conferences...
@dacahill7 Thansk Daniel. Have a separate riff on building โknowledge sharingโ mechanisms.
@dacahill7 Thansk Daniel. Have a separate riff on building โknowledge sharingโ mechanisms.
Thanks @steve773!!!! So great!! (I understand youโll be available for further Q&A in a Zoom or something? Can you share when/where to go?) ๐
Thanks @steve773!!!! So great!! (I understand youโll be available for further Q&A in a Zoom or something? Can you share when/where to go?) ๐
At first wasnโt certain, completely had me at the end... well done
A video on a better way to learn from your wargames (UK WW2 vs the pre-war Japanese experience described by @steve773): https://www.youtube.com/watch?v=fVet82IUAqQ
@steve773 Hi Steve! We'll hold your Q&A during networking - I'll work with you on getting that set up
Tim Harford on how organisational architectures can be impediments to taking on new technologies and ways of working https://timharford.com/2019/12/cautionary-tales-ep-6-how-britain-invented-then-ignored-blitzkrieg/
How do you solve SOX and SOD concerns for the App Teams to be able to change the code and deploy it to production?
How do you solve SOX and SOD concerns for the App Teams to be able to change the code and deploy it to production?
You can have peer reviews and let the pipeline release to production. SOX tenets do not mandate that reviewers belong to different orgs or even teams.
Being in financial industry the same person that has access to the code cannot deploy
in CI/CD, NO ONE deploys. Deployment is automated based on pushes to master
Having quality gates such as Definition of Ready and Definition of Done also helps with SOX compliancy in my experience
We do have CI/CD and deployments are automated but we are required to have a DevOps engineer be the one that can push that button. It cannot be a software developer.
We are thinking of implementing a pull request like process where the change will be reviewed and approved before the code gets deployed.
An approval step immediately before deployment might make sense. But the criteria should be "did all the automated tests pass? was there a peer review? It should not require more than about 15 minutes to "approve" a deployment.
@jonny disclaimer: this presentation is from our customer conference and Goldmans are investors in our company, but thought it might be interesting: https://www.youtube.com/watch?v=Q4QbL-BToLg
@tom.sheeran we are very close to that. We have all testing, code quality and security scans automated running on a continuous basis. What we are missing is the quick approval in our pipeline and the biggest step convince Internal Audit, Risk and Controls that it will meet the spirit of SOX and SOD.
This confession is giving me high anxiety and l am not even sure what that certificate meant!
@dominica.degrandis - this was a great story. TLS certificates are one of the biggest source of unavailability. The story is very inspiring.
@rohrersm Thanks Simon. Iโve got a piece about the interwar period, 1917 Trenches to Blitzkreig. Hereโs a preview. Looking forward to comparing with what you mentioned.
@rohrersm Thanks Simon. Iโve got a piece about the interwar period, 1917 Trenches to Blitzkreig. Hereโs a preview. Looking forward to comparing with what you mentioned.
Really interesting, thanks @steve773. Tim Harfordโs audio goes beyond Clayton Christensen and into Architectural Innovation https://timharford.com/2019/12/cautionary-tales-ep-6-how-britain-invented-then-ignored-blitzkrieg/ JFC Fullerโs tale is just fascinating on its own though.
Have you read Stephen Bungay's book, the Art of Action? He talks a lot about Von Moltke etc
@rohrersm Gene mentioned something about โstructure and dynamics.โ Iโve got a piece around โthe right way to organize. Attached, below.
Thanks so much! This looks Conwayโs Law adjacent too. I will take time to take a read.
"DevOps didn't come from Agile, but is a reaction to Agile" TIL
Thank you @cdavis for Confession! Thank you @damon and John Willis for hosting the Lightning talks!
Thank you @cdavis for Confession! Thank you @damon and John Willis for hosting the Lightning talks!
yes, thank you Damon and John. I had so much fun preparing my lightning talk! :lightning:
Really cool and funny
โWill I live with purpose?โ < :thumbsup:
When my weight shoots up after holiday meals: "that's just my error budget"
โl can help people - l am a people!โ - genius stuff from @damon
DOJO seems to be an overloaded term in the industry. I would be interested in learning the different implementations. For us it is an Immersive learning program acquiring new skills in Product, Tech, Agile and Lean while working on real products and real backlogs. Agile squads/teams learn together. No siloed role training.
DOJO seems to be an overloaded term in the industry. I would be interested in learning the different implementations. For us it is an Immersive learning program acquiring new skills in Product, Tech, Agile and Lean while working on real products and real backlogs. Agile squads/teams learn together. No siloed role training.
So for the dojo team i'm apart of with @bryan.finster we do our embeds by join the team in there space. We do VSM with the team and start by breaking down the work they have in flight currently and then work our way backwards to the backlog. During this time we are insuring they have a template pipeline for each of there tech stacks within that team. Insure they are able to gather all the metrics they need to measure improvement while also discussing the road blocks the team faces and addressing those issues. We become a part of that team for a 6 weeks and introduce the best practices we have learned and adapt them to be able to work within these teams.
BTW, I am part of the DOJO consortium with Walmart, US Bank, Verizon, Target, etc
Yes, We provide everything and anything the team needs.
We're very focused on delivery outcome goals, so it's less open ended than some other dojos.
We have Dojos anything from a half-an-hour discussion to introduce some core concepts or tools to a handful of people joining a team or team of teams for weeks or months to do whatever together to accelerate them. So, for me, even locally, it is a very diverse thing but always very specifically "investing time working together to improve something".
@daniel.eichten A bit late hopefully you are still around - We are struggling with the lock in conversations at the moment and obsessing with the myth of portability. Do you have any advice for convincing the exec about loving the lock in? or rather picking the lock ins you love?
@daniel.eichten A bit late hopefully you are still around - We are struggling with the lock in conversations at the moment and obsessing with the myth of portability. Do you have any advice for convincing the exec about loving the lock in? or rather picking the lock ins you love?
I've got one. Every technical decision is a sunk cost. Every economic decision should be based on marginal cost and marginal return. So decide where you're going to put X today based on the ROI.
@eadwin sorry for the late reply. Had to join a couple of other meetings here unfortunately. I can strongly recommend to read and use material provided by Gregor Hohpe on http://martinfowler.com: https://martinfowler.com/articles/oss-lockin.html. Please specifically check the matrix. And typically we pick based on the accepted lock-in topic. So if we see that a high level service is giving us a real benefit we go all in. E.g. we did this earlier this year launching a product completely based on aws serverless technologies. If there are services on the other hand that we don't like or have limitations we can't live with, we pick something vanillla and try to decouple from underlying IaaS as much as possible. E.g. we didn't liked EKS nor AKS but do vanilla k8s managed by a partner. Or we didn't wanted to go with Kinesis and didn't really liked MSK so we enrolled our own Kafka on aws. But if we'd have to move it will be not a huge deal.
@nick.jenkins yes we do this as well. But usually for us we also include Time-to-Market as a big item on top.
<!here> @steve773 will be doing a follow-up Q&A to his morning keynote at 330pm BST (in 90 min). Join here: https://us02web.zoom.us/j/8908483265
This could be you in here! https://us02web.zoom.us/j/8908483265 (live now)
This could be you in here! https://us02web.zoom.us/j/8908483265 (live now)
Reference for Capacity for Maneuver (CfM): https://www.researchgate.net/publication/312624891_Patient_boarding_in_the_emergency_department_as_a_symptom_of_complexity-induced_risks
โโฆyou could feel it walking through the hallways, the absolutely crushed morale.โ โsee, DevOps doesnโt work!โ
โโฆyou could feel it walking through the hallways, the absolutely crushed morale.โ โsee, DevOps doesnโt work!โ
its oddly similar to the stages of grief
The mental states of the Responsibility Process: Denial Lay Blame Justify Shame Obligation --- Responsibility ...
Okay, that's upside down and there is a side exit of Quit there on the side... These are easy to detect in others, harder in self and boy does it make sense to be aware.
@erica.morrison - how was the outage connected to "DevOps does not work"? Was it a timing question?
@erica.morrison - how was the outage connected to "DevOps does not work"? Was it a timing question?
The sheer fact that we had an outage like this. There was a belief that we should have prevented this and that we should have been able to fix much faster if DevOps did work.
People create narratives. I am guessing that anyone in the org that didnโt like DevOps used it as a stick to beat people with.
@erica.morrison It just occurred to me that the horrors of this outage was so great that it now defines a day on the calendar. Wow.
@erica.morrison It just occurred to me that the horrors of this outage was so great that it now defines a day on the calendar. Wow.
Yes, we actually noted the anniversary this year. It's a day we'll never forget.
That BlackRock3 training is truly excellent... We had SVPs and EVPs take it...
It can be anyone who shows the ability to lead. We tend to leave our SMEs out of this group because they are so often involved in the actual troubleshooting. The IC role is a dedicated role and if done right, the only thing that person is doing is running the call. For us, that has largely meant managers are the group free to run a call vs lower level technical team members who are assisting with troubleshooting. It absolutely does not have to be that way and in fact, the training insists that your normal title should not come into play for ICs. At the end of the day, you need someone with the confidence to run a call, make tough decisions (by asking the right questions), assert authority, and be technical enough to at least understand context.
IMS/IC is a pretty foundational shift. Even after having researched it, I didn't fully understand how to apply it until I took the training and practiced it. Then, it all clicked. And we needed support from others on the call to ensure we were following protocols, behaviors, and roles
The shift to broad training has been fantastic in our org. We kicked off our shift a couple of years ago with Blackrock3, and have since expanded to require all engineers in onboarding to take an Incident Responder course. Staring out with training puts everyone on the same page and gives them some confidence.
@alexa IMS/IC has certain behavior protocols and roles that are important for many stakeholders to understand. We needed many leaders across groups to understand those.
@alexa IMS/IC has certain behavior protocols and roles that are important for many stakeholders to understand. We needed many leaders across groups to understand those.
Iโve heard a lot of through line themes today related to Leader Training, Stakeholder understanding/education, Business appreciation.
Given me a test idea, random vLAN killing and low layer LLDP/spanning-tree protocol or other weird packet network storms/floods seem like particularly evil tools to add to a network chaos testing kit ๐
Given me a test idea, random vLAN killing and low layer LLDP/spanning-tree protocol or other weird packet network storms/floods seem like particularly evil tools to add to a network chaos testing kit ๐
random firewall accept rule disabling for a host
oh this is ๐: password expiring for the most senior member of staff
Indeed, when hypervisors and other cloud platforms offer APIs to configure (and misconfigure!) things like virtual firewalls, switches, LBs and other network infrastructure elements as code, things like dropping a whole VLAN or certain low layer protocol/traffic become more likely/realistic. Fun for dev/test but scary from an ops perspective!
oh... and false positive health check!
I think you can chaos engineer anything during office hours that at some point woke you up in the middle of the night and made you haul donkey to the DC
okay, and most health checks are false positive... HTTP 200 is not the same as "has had connection with the DB in the last fifteen minutes"
Had a personal false positive at home today, my IDS caught a malformed packet that apparently looked like an emerging threat and decided to stop trusting and auto-block the whole http://slack.com domain. Doh! ๐
Based on the phishing I get for my Slack credentials, I would happily announce that this Slack thing must be something evil.
@erica.morrison You mentioned that many people felt angry and wounded because of the incident. Who were the people who got past that and understood that your organization must learn from the incident? So champions of this learning experience?
@erica.morrison You mentioned that many people felt angry and wounded because of the incident. Who were the people who got past that and understood that your organization must learn from the incident? So champions of this learning experience?
A select group of leaders from our organization saw the need and championed this. It was not initially welcomed by all, but I believe almost everyone now sees the value in retrospect
I think if you asked most people, they would say it looks pretty much the same. Our mean time to assemble is probably faster now that most people are home most of the time anyway
One thing we did find is that you can run urgent events with incident command. We delayed a deployment due to covid across 65 people. That call was expertly coordinated by @steve.robert.barr using Incident Command.
In our experience there hasn't been an impact in our Incident Response at Slack, if anything when folks are on call for our Major IC rotation, they are likely to be closer to the keyboard. No hard decisions about where to try and grab a quick lunch when it is your kitchen.
How many outages do you get? How do you decided that an incident is worth of opening a bridge and going into "full" Incident Response mode?
How many outages do you get? How do you decided that an incident is worth of opening a bridge and going into "full" Incident Response mode?
I don't have the count handy, but we go into full incident response mode on issues we deem "Major Incidents." Typically, that is some sort of major functionality in a product or products is not working.
great point - trying to get updates all the time during an outage is highly disruptive to someone that is trying to fix the issue
I am normally one of the incident responders for one of our main products, no matter what team I'm on. I've tried to implement some of these strategies and worked with the product teams to have an incident response team with defined roles, really similar to some of these steps. We even acted through some past incidents to understand roles better. However, after a few months, incidents returned back to the normal chaos and explaining to managers instead of problem solving when I wasn't there to try to help the team use the flow again. Did you see regressions back to the old way? If so, how did you get people to buy back in to your incident framework?
I am normally one of the incident responders for one of our main products, no matter what team I'm on. I've tried to implement some of these strategies and worked with the product teams to have an incident response team with defined roles, really similar to some of these steps. We even acted through some past incidents to understand roles better. However, after a few months, incidents returned back to the normal chaos and explaining to managers instead of problem solving when I wasn't there to try to help the team use the flow again. Did you see regressions back to the old way? If so, how did you get people to buy back in to your incident framework?
We have not had a lot of regression. This is one of the reasons I highly recommend engaging with a group like Blackrock 3 and hitting a broad swath of people. We then had a formal program around this with the pilot and the checkpoints, with official reporting to executive leadership on a regular cadence about progress. There is enough visibility and buy-in at this point that it continues on its own momentum.
@dacahill7 Sounds familiar. When being pulled back, a change in approach that helped a ton was: "Help them only in ways that makes them own the product"
I think it aligns with my anti-disservice promise when coaching a team: 1. I will not do anything alone 2. I will not finalize anything 3. You'll own whatever I did
@nickeggleston Umm... I'll take that as an incentive to write a blog post on it. The whole thing is something I have cooked up when thinking of how "they can own their own path". It appears that the amount the teacher is speaking is inversely proportional to the amount the learners are completing their mental model. So, given that the three points can be turned around into: 1. If I am not working with them, how are they learning? 2. If I am taking the joy of completing something, what are they enjoying? 3. I am less important than the results.
If you remember, I would love if you would comment on this thread with that blog post
Did you find that the company then wanted to treat everything as an incident since it was so well managed?
Did you find that the company then wanted to treat everything as an incident since it was so well managed?
We have had some requests to use IC on lower priority incidents, which I mostly take as a compliment to the process but not something that we can fully staff for every lower level ticket. We do try to use a mini version of this for any bridge we have, with some people simply wearing multiple hats. We haven't hit a point where the odds and ends requests are unsustainable
Our incident manager have been asked to manage a number of non-incident stuff because incident management "works so well"
I worked for a company (won't name names) where they built a great incident management department, and then everything became an incident because it was worked so great to just manage everything that way forever.
โwe measure the level of pain that we cause our customers.โ โafter 2/4, we ate up the entire budget of allotted impact to customer.โ
โyou will look at this incident as a blessing, even though it wonโt feel like it at the time.โ (holy cow.)
Thanks @erica.morrison โย such an incredible story, and wonderful teachings!!!
Thanks @erica.morrison โย such an incredible story, and wonderful teachings!!!
Brilliant talk, love the idea of (often hidden/invisible or less visible!) complexity and failure combining to amplify a negative feedback loop but ends up being a huge positive learning experience, we should celebrate failure and anomalies, it's how we learn ๐
Fans of this talk will love the talk tomorrow that @scott.prugh is giving on the amazing work at CSG
Fans of this talk will love the talk tomorrow that @scott.prugh is giving on the amazing work at CSG
Thanks for the intermittent preview of coming attractions (to follow your cinematic mind set)
love this one Avoiding failure requires failureย - some things you can't learn except by doing.
@genek101 I see what you're doing... you're filling the afternoon with incident management talks to make sure we stay awake.
@genek101 I see what you're doing... you're filling the afternoon with incident management talks to make sure we stay awake.
Wow - what an incredible talk, and what great (hard won) advice. Thanks, @erica.morrison ๐๐
@allspaw: on typical reporting on incidents: โwe think weโre doing astronomy, but weโre actually doing astrologyโ ๐
โWhy do we fall? So that we can learn to pick ourselves back up.โ โ Batman. Great Talk @erica.morrison! Thank you:)
Hah. "We think we are doing astronomy but we are doing astrology instead." @allspaw
This gap that @allspaw is talking about reminds me of James Mickens points in his USENIX security talk: https://www.usenix.org/conference/usenixsecurity18/presentation/mickens
David Crossman (co-author of Team of Teams) has a pretty interesting take on blunt end vs. sharp end, and the sometimes vast difference between the two.
David Crossman (co-author of Team of Teams) has a pretty interesting take on blunt end vs. sharp end, and the sometimes vast difference between the two.
Leaders on the blunt end need to do oncall and participate on the bridge too.
Leaders on the blunt end need to do oncall and participate on the bridge too.
HIPPO tend to introduce reality distortion fields and infringe psychological safety
Hahahaha. @allspaw Itโs so interesting to have seen the episode before it airs on TV! (Biting tongue!! ๐ )
I was once called out to an incident bridge, ironically while in a hotel room in London, and we wound up with 2 VP's, 1 Senior Director and 6 Directors on the bridge. It did not go well.
An hour into the bridge no one could say what the issue was or even if there actually was an incident
Incident Command has some good protocols to deal with HIPPOs...
(I had to look that up โย I meant to look it up after @allspawโs recording. ๐
ding! ๐ ๐ ๐ about the fears of higher leaders thinking that incidents reflect on their leadership
technology leaders donโt believe it applies to them :rolling_on_the_floor_laughing:
PS: Iโm loving the interaction with speakers while their talk airs. Iโve never experienced anything quite like this before!
PS: Iโm loving the interaction with speakers while their talk airs. Iโve never experienced anything quite like this before!
it's a bit double edged though. Slacking (no pun intended) and watching at the same time is a bit difficult ๐
True, but imo it really helps with retaining the main ideas, like a study group
I love this model for virtual conferences, I was a bit skeptical about the canned recording of the talks but being able to interact live the attendees and the speaker during the talks is awesome. Great work team!
l try to answer Tweets when on panels on stage but mostly to flaunt the multitasking- this was seriously good and it goes well with the current trend of podcast-listening while doing other work surveys found is devs behaviour @genek101
Me, too! Totally love tweeting on panels as a display of a feat of multitasking prowess. ๐ But Iโm observing that watching talks, and Slack pushes me to capacity. Unlike every conference experience, I didnโt have Twitter front and center.
The problem is also compounded by the fact that in many organizations, the Technology leaders on the blunt end are the ones that communicate with business leaders and hence try to abstract the problem statement for business consumption by toning down the incident complexity
The problem is also compounded by the fact that in many organizations, the Technology leaders on the blunt end are the ones that communicate with business leaders and hence try to abstract the problem statement for business consumption by toning down the incident complexity
Nice summary: "you don't need the chart...just ask the [insert emphatic here] questions"
@siva.ss this needs to go both ways...they should provide customer context as well as just broadcasting out
โฆnow that I think about it, Iโm a bit surprised that @allspaw hasnโt dropped in five PDFs into this channel by nowโฆ ๐
โฆnow that I think about it, Iโm a bit surprised that @allspaw hasnโt dropped in five PDFs into this channel by nowโฆ ๐
@allspaw Wouldnโt this be the efficiency or flow metric of the incident resolution?
Interesting point about Mean Time to Restore having questionable value in this context - curious how that tallies with the findings in Accelerate, where MTTR is one of the four key metrics? :thinking_face:
Interesting point about Mean Time to Restore having questionable value in this context - curious how that tallies with the findings in Accelerate, where MTTR is one of the four key metrics? :thinking_face:
I think a low MTTR, as in < 1 h, covers those things. However, the metric is not enough to teach how to drive it down.
The way I teach others about MTTR is "you need to get people out of there to average below an hour"
Thanks @ferrix. Can you expand on โyou need to get people out of there to average below an hourโ - I think I understand but not sure. Thanks ๐
So, when you get nightly incidents that involve waking up, driving and debugging, you need 10 half-hour incidents to fix your MTTR to < 1h. So, removing people from your equation and automating robustness is the way to go.
Adding automation can increase challenges for people in incidents as much as it can help.
Automation, especially successful automation, hides how it does what it does. Incidents often require people trying to make sense of what the automation was doing or is doing, which is difficult when the irony is that it would โabstract awayโ that
There are studies on how influential this paper has been, and continues to be, after 30+ years: https://ckrybus.com/static/papers/Bainbridge_1983_Automatica.pdf
Very true. Black box automation bad.
What I have caught myself saying many times: "When you are automating something as simple as deployment and you are suddenly designing an algorithm, you are probably hurting the future you by making it too complex"
Automation is code, and code gets read (way) more often than it gets written. If the automation code is written with care for readability (which is still not the norm, sadly, I find), then the problem should be mitigated somewhat I guess . . https://www.goodreads.com/quotes/835238-indeed-the-ratio-of-time-spent-reading-versus-writing-is
This conversation has made me think of something new to me: the concept of designing for ease of incidents.
Nicely put. That has been a standard practice in automation for me, but the wording is new ๐
Yep, it sounds obvious doesnโt it, but I have to say Iโve rarely seen or been in teams who have explicitly designed for ease of incidents. Itโs making me think that it would be great to have ease of incidents as one of the standard considerations for every new feature to take into account . .
So that's a cocktail of other abilities...
testability for sure, traceability, transparency, repeatability...
yep . . what really got me thinking was John Aโs insightful comment that itโs the successful automation that can be particularly opaque. Means that communicating the intent and purpose of each piece of automation is super important, Iโm thinking.
The quote brings to mind Joel Spolsky with leaky abstractions. So somebody has to fix the plumbing eventually.
and when the plumbing breaks, it puts the human operator in an unenviable position
I hope it is the fresh water side of the plumbing (50% of the time it isn't)
to your very first question, @john710: my understanding is that in Accelerate, the metric isnโt literally the calculated mean of a set of timings collected
After 13h outage, getting back to the <1h mean is probably not the factual priority
in any case, even if it was referring to that (which would fall down as being useful to represent a non-normal distribution of unique data, nevermind the fact that the โbeginningโ and โendingโ of an incident is negotiable and much more flexible than many admit) it still doesnโt have any explanatory or predictive value.
It does have the feature of not reducing on its own when repeating problems are not addressed.
So it is predictive of the current best case to be expected.
itโs not, actually. the number (averaged over what time period?) doesnโt indicate anything about the incidents
if we look at a plot of this and ask to forecast past whatโs in the plot, analysis there doesnโt make sense and empirically isnโt valid ๐
> Traditionally, reliability is measured as time between failures. However, in modem software products and services, which are rapidly changing complex systems, failure is inevitable, so the key question becomes: How quickly can service be restored? We asked respondents how long it generally takes to restore service for the primary application or service they work on when a service incident (e.g., unplanned outage, service impairment) occurs, offering the same options as for lead time. The above is from the Accelerate book, where they introduce MTTR as one of their 4 key metrics.
Yeah, I see that side of it. However, it is indicative of "this organization is struggling to fix anything in under a couple of days" or "At any given time there are probably up to x dumpsters on fire and that seems to be acceptable"
@john710 trying to find the word average in there. If you look closely at the questions asked in the survey, they ask about <x or >y time, not asking for the distribution
Yep, agreed, hereโs what they asked for in the survey: > less than one hour > less than one day > between one day and one week > between one week and one month > between one month and six months > more than six months
@ferrix this is the point I made in my talkโฆwhat makes it indicative about the โorganizationโโs abilities, and not the complexity of the system theyโre charged with being responsible for?
My original reason for asking, btw, was that (from my non-expert position) Iโve been a fan of the 4 key Accelerate metrics and have been evangelising them to the organisations I work with - was asking as I was curious if you were highlighting some situations where MTTR wouldnโt be valuable.
the key issue is that the length of time an incident takes can be influence by a multitude of things, especially whatโs defined as an โincidentโ, and who gets to say when it began or ended
youโll notice that the book doesnโt speak about literally a calculated average across data points spread in time
@allspaw Having been the only guy to be woken up in the middle of the night to fix network problems where the network wasn't very complex, it was more of a staffing according to the needs problem than a complexity thing, seeing a high MTTR does not point you to complexity or, for example, enough people.
which is what โMTTRโ means and many organizations actually looks at like tea leaves
Can be argued that the organization is biting more than it can chew since it has complexities that it cannot digest in the middle of the night...
@ferrix and that reflection would be excellent data, were I doing our assessment project in ACL for your company. I wouldnโt ask just you, and Iโd ask more details about that.
True. Yes was going to say that (of course) MTTR contains the word mean, which suggests a calculated average (to a layman anyway)?
ps Iโm sold on what youโre saying about incidents learning, namely the need to talk to people rather than relying on arbitrary data
@allspaw Yup and at the end of the day it is a matter of which types of problems you can handle
"Maybe we invest on these kinds of risks next"
who is โtheyโ โweโ โyouโ organizationโฆthe key idea is to understand in a grounded and concrete way what is difficult for people, what rationales they have for doing the things they do, and then begin to aggregate based on that data, not to assert an abstracted aggregate view that canโt be supported by underlying data
qualitative data and research is indeed difficult to do, but engineers (aka the people responsible) donโt tend to see their experiences reflected in these high-level statistics and therefore can be skeptical (rightly so) when decisions that effect their work are based on them
I never collected it at Etsy where I was CTO. The value is negligible compared to other forms of learning from incidents.
The situation I often find myself in as a coach is that an organisation are going down a path of truly awful โtransformationโ metrics, and I have the chance to influence them to do something better. My go to recently has been the 4 key Accelerate metrics, as they are much better than what the organisation would otherwise choose.
Iโd argue that what is in Accelerate is a fine basis to stand on. Just note that itโs not literally โMean Time To Xโ
When used in a team for the team I've found that to be helpful for seeing progress when starting from some instability.
โchangeโ is ambiguous, negotiable, and when a metric collapses a high diversity โpopulationโ into a single value, it wipes away the important parts to understand
change and even fail can be really flexible, and tend to get stretched if what they describe is used for decisions
Hereโs the definition for CFR from the Accelerate book > We asked respondents what percentage of changes for the primary application or service they work on either result in degraded service or subsequently require remediation (e.g., lead to service impairment or outage, require a hotfix, a rollback, a fix-forward, or a patch).
it just doesnโt capture the genuine ambiguity and difficulty people face, and doesnโt shine light on situations where productively anticipating an adverse situation might arise and headed off
@john710 if I asked you what percentage of messages in this Slack in the past two days had questions in them, what would you say?
if I were to use the Slack API to collect all messages during that time period and tabulated questions versus non-questions and then compared that to your answer of โnot manyโ - might I find a difference?
You made the point above that the imprecise ranges asked for in the survey are fine - so conclusion is that (for MTTR and CFR) precision will be arbitrary and is not required, is that right?
the data theyโre collecting is about the respondentโs perspective, which is valid data. Itโs not, however, representative of what I think sometimes readers walk away with thinking it represents (even if the book spends ink on interpretation about the data)
from my understanding, the conclusions Accelerate make from the data about respondentโs perspectives on how long it takes an incident to be handled is not at all the same as what companies we see use it for, which is trending and tabulating and decision-making
In Accelerate the metrics are proposed as measures for software delivery effectiveness.
Theyโre not proposed as measures for incident learning, for instance (if I recall correctly)
Right, which is very different in many ways than a VP of Eng looking at a chart of โMTTRโ and concluding that the teams need to โbuckle downโ or โhire more peopleโ or โmove quicker to the cloudโ
They are proposed as measures for devops maturity or something like that but that doesn't prevent setting bonus targets based on them.
these are significantly more valuable than wasting time and money tabulating โlengthโ of incidents
So, a high MTTR and CFR will tell you that "there is some struggle somewhere" or "it seems like the struggle is not constant". ๐
A high MTTR and CFR can also tell you what the collector of that data wants to tell you
Gotcha. Maybe it is an individual struggle ๐
BTW, in my current environment those get collected automatically as a side product. Standardises the way they are gamed, at least.
a CSS change that moves a logo into barely noticeable different positionโฆand a CSS change that hides a โCheckoutโ button. a 100-line code changeโฆ.and a 1 line code change. <- the 100 line is only an HTML comment, the 1 line is a feature-flag turning on a new feature
@allspaw so would you recommend that, if applying the 4 Accelerate metrics in an organisation, (for software delivery effectiveness, not learning from incidents, I mean) to use a similar approach to that taken in the book - i.e. individual surveys with broad ranges?
a blog post being published (hosted on entirely different systems) trigged an outage with http://Etsy.com once. is a blog post a change? was it included in a CFR as a โchangeโ to begin with?
@john710 Iโd ask the authors ๐ research can be done with all sorts of methods, and the authors might even use different methods in the future
Would be great to get your thoughts on this conversation too, @genek101 - I love that Slack is asynchronous, so I donโt feel too guilty about asking you, please take your time as I suspect youโre more than a little busy right now! ๐
โโฆoften hands on practitioners donโt capture what made the situation difficult.โ Ah, yes. The daily work arounds and near misses! So good!
Enabling decentralized decision making and a continuous learning culture is something key here. Since the folks closer to the problem are better informed. However, fear of failure and lack of trust can be a major hurdle in getting to that team dynamic. I wonder how this cultural change can be achieved?
Enabling decentralized decision making and a continuous learning culture is something key here. Since the folks closer to the problem are better informed. However, fear of failure and lack of trust can be a major hurdle in getting to that team dynamic. I wonder how this cultural change can be achieved?
The most important measurement is time-to-learn-and-improve which is all about measuring the learning for reuse
'Tyranny of Metrics' was one of the books - didn't catch the title of the other.
Safety Can't Be Measured by Andrew Townsend and The Tyranny of Metrics by Jerry Muller
@allspaw - are you going to later cover best practices for writing those reviews in such a way that they are consumable by others?
@kenny not in a 30min talk, but note that itโs not just about writing, itโs the analysis that happens prior and during the analysis
That does somehow rhyme with the high and low context environments from earlier today.
โฆah, I forgot. @allspaw hasnโt been putting the PDFs here, because the links are in his slides! Would you mind putting the links into this channel, John? ๐
NTSB = โU.S. National Transportation Safety Boardโ. Famous for studying airline crashes. https://www.ntsb.gov/Pages/default.aspx
NTSB = โU.S. National Transportation Safety Boardโ. Famous for studying airline crashes. https://www.ntsb.gov/Pages/default.aspx
Totally read that as โU.S. National Trampoline Safety Boardโ. I may need a lie downโฆ
Maybe in this virtual delivery, it would be helpful to have one place to go for all reference materials for any presentation.
These tips are so useful for any retrospective.
"Half of my job is to get people to genuinely look forward to and participate in the next incident anaylsis" Oof. I need to get better at this.
Taking action item generation out of group review meetings is probably the most controversial suggestions, weโve found.
Taking action item generation out of group review meetings is probably the most controversial suggestions, weโve found.
I can imagine. "But what are we going to do?" "You're going to sit on the feedback and take it in for at least 24 hours before writing a single action item."
@allspaw regarding incident analysts being outsiders... how do you think the result of the inquiry into the Challenger would have been different without Richard Feynman?
@allspaw regarding incident analysts being outsiders... how do you think the result of the inquiry into the Challenger would have been different without Richard Feynman?
I doubt anyone else would have used a glass of icewater on stage teh way he did
@nickeggleston ask your CTO to lead the next group incident review meetingโฆ.itโd look a little like that.
@nickeggleston ask your CTO to lead the next group incident review meetingโฆ.itโd look a little like that.
OMG, YMMV but the first three CTOs coming to mind leading any conversation...
The comparison with solving coding problems when we've come away from the computer and taken a break makes sense. We don't always solve the problem there and then.
@brian.martin - thanks. i think others (but not everyone) is having an issue. hmm weird
nice talk though. the change of pov from just solving incidents to learning from it is nice
Ok leaders here in the channel: whoโs going to take me up on these challenges?
Well, @erica.morrison, thatโs some pretty awesome kudos from @allspaw!!! I agree!!! ๐
Really nice job @allspaw and @erica.morrison!!
Iโm super honored to follow up @erica.morrison! Thanks for listening folks!
Iโm super honored to follow up @erica.morrison! Thanks for listening folks!
Our entire neighborhood lost power like one minute into your talk. Grrrrrr, I missed it. We have power again now - will have to check out the recording. Thanks again for all you have taught us!
My favourite talk of the day - hands down! Thank you @allspaw! Already want to go back and listen again!
@jeff.gallimore will be sharing when/where @allspaw will be hosting a Q&A Zoom (or whatever) session later today! Thanks for doing that, John!!!!
@allspaw Unless the 'other' teams have equity or skin-in-the-game generally, I don't see how would this happen...Thanks for the insights.
Superb talk, thanks @allspaw thinking of wild goose chases down rabbit holes reminded me of one particular 'near incident' lesson I learned a long time ago... OMG was that a Natural England logo and some maps my sense of deja vu is so strong today :thinking_face:
@helen.beal phrases I wasn't expecting to hear at DOES: "These are the sheep." ๐
@helen.beal phrases I wasn't expecting to hear at DOES: "These are the sheep." ๐
Haha I'm glad you enjoyed it - I thought I'd do something completely different! It is a very special place and you should come visit! ๐
@allspaw will be doing Q&A in a bit! https://sched.co/cnwL
@allspaw I frequently study reports for motorcycle accidents because I try to learn from others to stay alive. With an organization like mine that has so many stacks and so many different environments, what suggestions do you have to make incidents relevant to others in related stacks more obvious? I'd probably not attend a Mainframe incident review if I didn't think the information would help me improve.
@allspaw I frequently study reports for motorcycle accidents because I try to learn from others to stay alive. With an organization like mine that has so many stacks and so many different environments, what suggestions do you have to make incidents relevant to others in related stacks more obvious? I'd probably not attend a Mainframe incident review if I didn't think the information would help me improve.
Doing incident analysis well means understanding how others understood the event. When you can represent what was surprising or difficult for people closest to the incident (especially after theyโve expressed it in interviews) - in a compelling narrative, people start to build expectations that they can get something out of a well-captured analysis even if theyโre not expert in the technical details.
@allspaw It was mentioned tracking specifically who is "voluntarily" reading the incident reports. If an important part of my job as a practioner is to get people excited about incident response, do you have suggestions to find ways to get people motivated and energized?
@allspaw It was mentioned tracking specifically who is "voluntarily" reading the incident reports. If an important part of my job as a practioner is to get people excited about incident response, do you have suggestions to find ways to get people motivated and energized?
We find that practitioners are already enthusiasticโฆwith engineers, once we get them talking about incidentsโฆwe can barely get them to stop!
The key is to identify what about the incident was/is mysterious or interesting to them, and represent that faithfully in the writeup. Once people see a demonstration that writeups can be a place where they can learn things they canโt elsewhere (and that it reads different than what theyโve been used to in past reports) โฆ the interest will be difficult to stop.
I think there is some pride in finally finding the solution and you can tap into that to get them started and then you'll have to keep their timeboxes ๐
Yes, @ferrix - you canโt stop people from thinking of ideas, but you can instruct them to write them down during the group review meeting, so they can get away from the fear of losing the idea.
I guess it's a matter of engineering aloud
FWIW, weโve worked with a client whose writeups later were getting 10-20 unique views per day up to 6 months after the incident, and people were still commenting, highlighting, and linking to it from other places
Oh wow. Since engineers are not usually authors by profession, that level of interest is really impressive.
you might be interested to know that a significant part of the audience were folks from customer support, product management, and design
I have noticed that side jobs as a stand-up comedian and a copywriter really help with writing exceptionally readable company-internal blog posts.
So I'd go on to assume that there is a certain requirement for clarity and tone of voice to those documents that has made them very valuable outside the tech bubble.
Yep. Where thereโs specific technical jargon, theyโll link to more tech details elsewhere. Where diagrams and pictures can be used to make things clearer, theyโre used. Writing in plain language helps immensely.
A great deal is the idea that youโre writing them to be read, not just writing them to be filed.
I have noticed that "dumb it down" is not the way to go but ruther "general smart it up"
Reminder: Donโt miss an opportunity for live Q&A with @allspaw at 6:25pm London time as a follow-up to his closing talk today. Join the discussion at https://us02web.zoom.us/j/83625560792?pwd=c1Y3b1dTTklqQ3NOTTkxa2N6SGxGdz09#success
<!here> Jon Smart is in his happy hour room! Come join us here for an AMA: https://us02web.zoom.us/j/8908483265
I'll join the fun.. Here's a zoom link where I'll be hanging out to talk about the Handbook or Beyond the Phoenix Project.ย ย https://us02web.zoom.us/j/87567853321?pwd=RVY2MjFQaUwxSmpQSWY0azRmTFNZZz09
Following up in the @allspaw Q&A re Conway's law - the authors of Team Topologies use it to argue for intentionally structuring your teams in order to facilitate communication patterns that are helpful.
@kboth_does attaching some materials on the โright way to organizeโ that might be of interest, apropos of Team Topologies reference and @allspaw+A. Happy to discuss.
@kboth_does attaching some materials on the โright way to organizeโ that might be of interest, apropos of Team Topologies reference and @allspaw+A. Happy to discuss.
More in my @steve773 read stack! Who needs sleepโฆ
Right now these are joining my backlog of post-conference material to absorb. One downside of the virtual conference is that I don't have 10 hours in an airplane to work off the backlog. On the other hand, I did discover that jet lag happens even virtually - waking up at ~0300 BST this morning...