This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
2021-05-18
Channels
- # ask-the-speaker-track-1 (268)
- # ask-the-speaker-track-2 (106)
- # ask-the-speaker-track-3 (338)
- # ask-the-speaker-track-4 (216)
- # bof-arch-engineering-ops (2)
- # bof-covid-19-lessons (3)
- # bof-leadership-culture-learning (1)
- # bof-sec-audit-compliance-grc (12)
- # demos (13)
- # discussion-main (1086)
- # discussion-more (10)
- # faq (4)
- # games (76)
- # games-self-tracker (14)
- # gather (36)
- # happy-hour (91)
- # help (99)
- # hiring (26)
- # lean-coffee (16)
- # networking (9)
- # project-to-product (2)
- # psychological-safety (2)
- # summit-info (424)
- # xpo-anchore-devsecops (12)
- # xpo-cloudbees (1)
- # xpo-copado (33)
- # xpo-gitlab-the-one-devops-platform (15)
- # xpo-harness (1)
- # xpo-hcl-software-devops (20)
- # xpo-ibm (5)
- # xpo-itrevolution (18)
- # xpo-launchdarkly (15)
- # xpo-mirantis-devops (5)
- # xpo-pagerduty (9)
- # xpo-planview-tasktop (9)
- # xpo-redgatesoftware-compliant-database-devops (1)
- # xpo-snyk (3)
- # xpo-sonatype (5)
- # xpo-split (3)
- # xpo-synopsys-sig (8)
- # xpo-tricentis-continuous-testing (2)
fantastic courage and commitment from everyone involved at TUI to bring about such a massive transformation!!
fyi, main conversation currently in #ask-the-speaker-plenary
Coming up in a few minutes – @ben.connolly and @sabina.kambersalamanc
Let's GOOOOOOO! 🙂 We're here for any questions or anything else folks.
curious to hear the key to unlock! Curious the role leadership at the top played vs bottoms up
Are the 'red' shaded countries where software engineering is, or Vodafone more broadly?
that's vodafone's footprint generally. purple are partner markets. (mental note to add a legend in future)
and amazingly almost every one was hitting them. :thinking_face:
Yep, in every essence "it felt like that and continues to feel like that to date"
insanity is to keep doing the same thing over and over again and expecting different outcome... Except if you consider already delivering perfection, there's always room for improvement. 🙂
We tried lots of things before the OKRs, but only after we started to trial OKRs is when we started to feel cultural change we were after: • empowerment dial increasing • alignment to a common purpose improving • lowering the fear of failure, but learning to embrace it and continue to learn
Interesting. Did you start with OKR outside of any HR process (like "annual performance assessments"). Thought about submitting/incepting the idea to HR, but did not get that much echo
Perdoo company has a talk that specifically advocates against running OKRs via HR in any way. If I find it, I will share it with you. It's not recommended
Looks like this option may impose by itself as it didn't anyway found any echo in the void. Would be great for the paper Sabina! And thanks for the session!
Felipe Castro writes the following in his https://felipecastro.com/resource/The-Beginners-Guide-to-OKR.pdf
As soon as you involve HR, they are going to want to tie this somehow to end of year reviews and compensation, which can be counter productive. There is a lot of debate around this out there in the community.
Actually one of the point I wanted to raise to them is the need for decoupling bonus and these assessments as you end up having people only focused on the compensation question. And had a few other griefs against the system in place as well,
It’s a method pioneered by Andy Grove at Intel. https://en.wikipedia.org/wiki/OKR
• KPIs are indicators how we are doing. • OKRs are our navigational system as we go towards our north star
You can see it now in the talk. KPIs and OKRs
I like to think of KPIs as 'health indicators' e.g. heartrate KRs are measures of progress, measures of movement, leading and lagging (e.g. I've run 5 miles of my 10 mile run aspirational outcome)
is this a SAFe implementation? I heard program increment
Increments are specific to Scrum (SAFe - just the implementation of SCRUM)
Blended I'd say. 🙂 We use some aspects that help bring structure where otherwise we're still a little nebulous.
Am curious how SAFe and empowered teams blend. What sort of team coaching approaches do you use?
https://doeseurope2021.sched.com/event/jMFb/sooner-safer-happier-ama-with-jon-smart
oh, I think you mean this talk: https://doeseurope2021.sched.com/event/i547/ok-notok-okrs-3ms-mindset-mission-and-measurement
@logankd to answer your question, a focus on outcomes over output is one of the patterns in Sooner Safer Happier. OKRs are a great way to have a focus on outcomes (and the mindset that come with OKRs)
Dr. Mik Kersten (Project to Product, Flow Framework) is also talking about OKRs and metrics this week (Wednesday, 19th May @ 12:05pm BST) https://doeseurope2021.sched.com/event/jEMq/okrs-devops-from-micromanagement-misery-to-finding-flow?iframe=no
That's good, I've read his book and am still searching for ways to use the ideas (I'm a developer, but share a lot of ideas with the team I work with)
A place to start: https://www.whatmatters.com/faqs/okrs-objectives-key-results-explanation-examples/
I have usually found engineering teams often have a hard time articulating good OKRs though. I prefer focussing on the communication aspect, trying to ensure that every last person on the team understands what we are trying to communicate.
@sabina.kambersalamanc - what techniques did you use for multiple teams to deploy independently i.e. feature toggles etc?
hey @tim.bassett - yes, multiple techniques/tools. @robert.greville1 is giving a talk later today on exactly that!
Commonly we’ve been using parent child flags where one team would look after the parent and lower level teams would maintain their children. We’ve tried to ensure logical separation of flags between environments and teams. Hopefully more on this in our talk at 14:50 on #ask-the-speaker-track-4
Thanks - I had that down as one of the ones to go to 🙂 Also interesting in any branching strategy or do the flags do away for the need etc.... Look forward to the talk
Flags have enabled us to move fully to TBD (Trunk Based Development). Previous to that although most of our teams had moved too, some were cherry picking release branches. LaunchDarkly enabled us to move towards this and really move away from several branches, with varying levels of code in them, in various environments - it really cleaned it all up.
"The teams started to feel empowered, excited. The throughput started to skyrocket" The first step to team nirvana?
@ben.connolly @sabina.kambersalamanc Really enjoying your talk, particularly the emphasis that mindset shift and psychological safety lead to better OKRs. How did you transition to a culture of better psychological safety? Did you find that there were some areas still looking for security using command and control and, if so, how did you tackle this?
Great question! This is one of my favourites as it really gets to the heart of the 'alternative' leadership values needed. Will do my best to summarise!
Very intetested in this question, I find command and control to be so deeply embedded
Who defines the OKRs at Vodafone? Is it the teams themselves or are they set in a top down fashion?
As we went trialling, we were doing shaping them in Ben's Leadership team. Now, we are starting to work with the teams to shape them together, as we come out of the trials. More on this, at some point in the future
@sabina.kambersalamanc did you stick to company/tech wide OKRs or did you also ask teams to come up with what is relevant to them, on their level?
See https://devopsenterprise.slack.com/archives/C015DUDD9C5/p1621339961055600?thread_ts=1621339767.048400&cid=C015DUDD9C5. Does that help answer?
We are trying to cascade OKRs to the teams because we see different teams struggling with different obstacles
It’s a big step to decoupling from a large monolithic release into many, independent releases in terms of platform/software architecture. I’d love to hear a bit more about how this was done.
@ben.connolly , @sabina.kambersalamanc impressive numbers. not sure whether my ques will be answered in upcoming slides. But how was it linked to business outcome.
Great question @siddharth.pareek - for these initial OKRs our real outcome was to achieve & demonstrate the business agility we're striving for. Our ability to deliver value faster has really seen a step change, which in turn has started to change behaviour around us (in order to better leverage that capability). Long way to go, but we're cooking!
also, did you have a coach helping with "good OKRs" vs "bad OKRs" or just learn through practice?
Psychological safety and removing fear of failure are critical to team success
I am the coach for the entire digital engineering, and our Scrum Masters act as coaches for our teams. I am not saying it is like that for the rest of the org, but it works for us.
@ben.connolly @sabina.kambersalamanc how do you deal with situations where you have set OKRs for a quarter or a year and now want to change them? As leadership, a common quesiton that comes for teams is “but we had set those OKRs for a quarter, why should we change them now?“. Often teams struggle with a sudden need for change either driven from the management (change of business priorities due to forces out of control) or from the team (change in undestanding of the problem and feasibility to provide effective solutions).
Hey @bfischer yes, sure. Might be worth a call later though. Lots involved in that one!
Main thing we're trying for is to share ownership, better leverage the scale of the team, and be able to work in a much more concurrent way, rather than always sequentially (and regularly forcing prioritisation decisions to be made)
So, did you measure the number of teams who had accepted pull requests from anyone outside of the team?
Inner sourcing OKR, when we commenced (which was last quarter) was the following: OKR: Every team must accept, approve and successfully deploy ONE PR into service they own from a different team. Overarching objective we are working to across our quarters: We will be inner source capable across all services.
"OKRs drive cultural change, not just process improvement"
are you using specific tools to communicate and gather data on your OKRs?
i have heard good things about another tool called https://www.xto10x.com/
Good question. @ben.connolly @sabina.kambersalamanc are you using a tool for OKR transparency?
No specific tool at the moment. We are looking at this for the future, when Vodafone is ready to scale on this. Right now, it is a combo of data from AzureDevops, packed up in end of Sprint Reports that our Agile Lead and SMs run every 2 weeks. The reports are powerpoint presentation style.
Everyone attends the end of sprint reviews, which shows transparently the OKRs status. It is a community of technology, agile reps, POs, PMs etc...
Yes - psychological safety and agile servant leadership - you can empower the team and release yourself from all the pressure of leadership on your shoulders by building PS
Thank you @ben.connolly and @sabina.kambersalamanc for sharing your journey, great talk!
Some awesome questions there! Thanks everyone! We're working through them. 🙂
Great talk @ben.connolly and @sabina.kambersalamanc - thank you for sharing
What sources have you used to learn about OKRs @ben.connolly? And of course, thanks for sharing your great story!
Mainly: Perdoo and Felipe Castro materials. Also, lots of others, but the above two were key for us to start with
@vladyslav.ukis a variety of mechanisms: • Built in tags as part of the any approved modules (eg. Terraform modules, stackets, template spec .. pick your poison) • Provide a set of set of Finance modules that we expect product / apps teams to run as part of their services • A default set of Finance based policies (AWS SCP - Azure Policies) that we run on the product environments (AWS Accounts, GCP Projects, Azure Subs) that are run as part of the vending lifecyle of these environments
This is cool! Would be interested in how to integrate F&P deep into the product delivery process!
@vladyslav.ukis, we've learnt to gamify the process and make Benefits and Cost visible to everyone .. just like you would do with Reliability metrics on Engineering screens. So as part of the Product team, we have a FinOps SME embedded, or Financial viability is part of the Well Architected review and Route to live.
depends on the level of maturity and of the teams and what outcomes you're trying to achieve. Starting off with on part time SME to support under 6 teams is sufficient. As you start to move towards 12 - 15 teams having a person full time is useful; as the cost savings and benefits ends up paying for the role
It's adopted based on the maturity model by the FinOps foundation, which is open source 🙂
What are some of the most established tools for FinOps/Financial Traces aside from the CSP tools? Is there anything you recommend to work with?
Cloudability & Apptio has been a very mature one that I used. I do also like the Shared Cost allocation feature on Azure Cost Explorer
Love the financial trace - really useful framework for getting costs nailed down for organisations focused on cost. What about organisations focused on value - is there a similar FinOps trace/framework for orgs focused on value rather than cost?
I guess the "Benefits Summary" part of the screenshot could help in that direction
@andy.farmer as part of the establishing the trace, you start to tag and align the consumption against the actual business value. You can pluck out the value leavers from the Cloud Benefits framework, and tag either your AWS accounts against it or a collection of applications.
Have you been able to get data in from PPM/timesheet tools as well? (where companies use those)
we've focused our efforts more on outcomes as opposed to outputs. So we've not incl. timesheets or people time against it .. but I can see why this would be of interest to someone. Ironically, that's where integrating with something more value and flow focused like Tasktop would be a better way of looking at it 😉 .
My thinking is - including this financial traceability info automatically into the business results datasets that we work with... gives a nice realtime link.
I’m guessing the business loves the insight into IT spend. What are some of the responses you’ve had from the business about FinOps?
They've loved it. They've felt part of the product development and decision making process as oppoposed to feeling like outsiders who don't have a role to play in the world of cloud. It's also help Finance teams realise how they need to change the processes, specially if Cloud is going to be the norm for their organisation. So all in all, it's one of the best things we could have done.
Less so of a crisis, more so a question being asked by Finance teams what role do we actually have to play in a cloud-first world
Learned a lot from the talk. Thank you very much, indeed, @deepak.ramchandani!
Coming back from the break, we welcome @christina_yakomin from Vanguard!
Please don't have any outages, a lot of attendees will have money in Vanguard 😆
She did just mention “Chaos Engineering”. Don’t they specifically generate incidents on production to test resilience? 😁
At Vanguard, we don’t run our chaos tests that we expect to cause outages in production, for reasons that are probably obvious! We do however frequently run these kinds of tests in non-production, and have run tests with a limited blast radius off-hours in an isolated segment of our production environment (separate from where our client traffic is routed!)
how is the dev/prod parity in case of vanguard?
@christina_yakomin Thanks for not running chaos tests in production 🙂
@simon.skelton and I will talk on Thursday about how John Lewis & Partners run Chaos Day tests in their one pre-prod env
Thanks for the clarification, @christina_yakomin. I hope it was clear my comment wasn’t entirely serious, but it’s certainly interesting to read how you have things set up at Vanguard. 😊
@kapoor.vaidik no non-prod environment will ever perfectly mimic production, so anyone who tells you they’ve achieved parity is lying (or confused)! But even though we know it’s not exact, we still get a lot of value out of our non-prod chaos experimentation. I talk a lot more about chaos testing at Vanguard in my presentation from SRECon last year, which you can find on youtube by searching my name and SRECon
Hi @christina_yakomin curious - did you move your database to the public cloud or did you just move the apps and have them connect back to the datacenter?
We have a LOT of different applications at Vanguard that are in various stages of migration. Some have “fully migrated,” data and all, while others still leverage data on-premise. Our goal is to migrate almost all of the data securely to the public cloud (think ~90%)
Most of our apps running in the cloud at least perform reads against data in the cloud
Does that mean you're doing multi-writes, one to the cloud DB and one to the on-prem DB, for the same transaction?
Or doing writes to the on-prem DB and simply replicating to the cloud DB?
I’ve heard from some people that they had rude surprises when they moved applications to the cloud but kept data on-prem: major networking costs and bottlenecks between the tiers. Any similar experience here?
@davidstanke532 We definitely co-located the data in the cloud primarily to address concerns about performance due to latency between the app and the data. Leveraging redundant AWS Direct Connect does help us a bit here. I’m no expert on all of our data architectures, but I will say from experience that hybrid data architectures (partly on-prem, partly cloud) should be temporary. Relying on replication long-term is going to bite you eventually!
My favourite saying is "temporary is permanent", yes you have to be careful with temporary DB replication
@steve.smith exactly. I’ve seen replication far more than multi-write. Teams that have had the most success in migrating their apps to the cloud have already moved on from that replication stage, having built confidence in its functionality, and are running 100% cloud, app and data.
My experience has been multi-write was popular back in the day to replicate from one on-prem DB to another in code Now it's on-prem to cloud and AWS etc. have solutions we can just plug in 😓
+1 — IMHO, managed cloud SQL offerings are some of the most useful cloud services. The flexibility w/r/t backups, replication, scale up/down are amazing. Yes, it’s a bit scary to transplant the beating heart of your service to another host but I have found it to be super liberating. Throw away your backup tapes! 😄
@christina_yakomin Have I missed a bit on operating model, or is it coming up? Do you have product teams on call out of hours for their microservices?
Yes. Product teams are now on-call for their services. In the pre-microservices era, that prod support was much more centralized, and app teams rarely needed to support production. As part of the shift to microservices, the production support model shifted, too.
How did the teams cope with suddenly being on call? Did you do anything to ease them into that role? It's pretty scary to suddenly be on call for something..
It was a gradual shift. On-prem legacy: centralized prod support Initial microservices: shared ownership of prod support Cloud-native apps: app teams primarily own their own prod support
That's great to hear @christina_yakomin! So a similar journey to John Lewis & Partners
@simon.skelton and I are talking about it on Thurs, hopefully you can make it and let us know what you think
We still have central teams of experts to assist with prod support as needed, have invested a LOT into training, and now with our rollout of SRE roles, we are building a lot more operations expertise on our product teams
so you will have SRE roles in every product team?
That's the kind of approach at Equal Experts we recommend to our clients - build up the product teams, spin up enablement teams to assist teams at scale when necessary 👍
@christian.rudolph we will have SRE roles for every family of related products, and certain products requiring very high availability may have dedicated SREs as well. The introductory level of our internal SRE training program will be shared with ALL engineers who are part of a product’s on-call rotation, even if they don’t have an SRE job title
I assume yes if you're looking for 99.9% availability, or maybe you've got an SRE on-call team for extreme high availability services say 99.99%?
At Vanguard it varies by service. For some critical trading platforms, we’ll aim for 99.99% or 99.999% while our marketing sites may not have the same criticality - more like 99.9%
I ask because @simon.skelton has overseen John Lewis & Partners moving to on-call product teams, 99.9% is needed for http://johnlewis.com. It wasn't practical for one ops team to be on-call for 20+ product teams. A superhero SRE on-call team wouldn't have been cost effective for 99.9%, either. Retail companies rarely need true 99.99% availability, in my experience
@christina_yakomin may be you covered this, but typically who is the first responder to an alert in your model? Is it the developers who are owners of the microservices? Or the SREs? or both? How does collaboration / knowledge exchange happen while an incident is being dealt with?
Today, in most cases, it is the on-call engineer on the product team for the impacted product. SRE experts for each line of business and operations experts from centralized technology teams may be brought in to assist by our incident commanders as needed. We have an entire team of excellent incident commanders who ensure that communication is effective during an incident call
Do alerts auto-escalate to “defined SREs” for areas of the business if not resolved within an SLA?
To SREs, no, but we do have auto-escalation in place if incidents are not acknowledged. Usually it is up to the discretion of the responding engineers and the incident commanders to escalate further and pull in additional engineering resources for troubleshooting, driven by the severity of the incident in most cases.
That’s a classic where logs end up with 2 KB JSON objects making absolutely no sense
100% downtime is a much better easier goal! 😎
This is true. Although deceptively so if I look at how many people get pinched by their Cloud Service Provider bill because they somehow missed bringing down some part of their infra. 😂
It's scary what you find lying around in some cloud accounts indeed 👀
But more serious, how did you arrive at your SLOs, was there some period of measuring all interactions, or some hard requirements?
It is very hard to centralize this and provide standards. It really comes down to knowing your clients’ expectations. Our technical teams partner closely with business product owners to set these thresholds and re-evaluate them quarterly. Many teams are working to improve their overall availability, so they’re making their SLOs stricter each quarter as reliability increases. Eventually we want to see teams operating at their target SLO and using their error budgets effectively.
We’re trying to move away from alerting on causes to alerting on symptoms at my organisation and i believe the setting of SLI and SLO for performance metrics is key to this.
Nice point on making good monitoring dashboards.
For this reason, we ended up building https://github.com/grofers/legend.
After doing my time building dynamic grafana dashboards I will certainly look into this, thanks!
the idea is to have best dashboard design practices built in the code
Thank you, Christina! Coming up next, @matthew.pegge and @ilia.shakitko !
@christina_yakomin how do you share your Post Incident Reviews to ensure everyone learns from them? P.S. Thanks for a great presentation!
We have internal “blogs” that I’ve leveraged in the past to share incident review write-ups. These allow likes and comments to encourage discussion and questioning. We recently added the ability to track “views” on these pages, as well. In my experience, my post-incident reviews are always my most-viewed and most-liked blog posts. Seems like the engineers at Vanguard are more interested in how I broke and later fixed something than they are in my new feature releases.😂
I agree, there's always lots to learn from things we break, and it's usually a very interesting set of circumstances that led to it. I guess we all just like detective and whodunit stories!! 🔎
@ben.conrad @mhyatt: Does HMRC still have an internal blog for sharing post-incident reviews? I remember something similar in 2017. Thanks
@christina_yakomin I'd definitely recommend tracking view counts on post-incident reviews. Out of context, it's a vanity metric. In context, it's a cheap, effective proxy of organisational learning Until John Allspaw thinks of something better, that is
100% agreed. I was very excited about the recent addition of this simple feature.
At HMRC we publish all our PIRs and have a slack channel where we post links to them as well.
Thank you @christina_yakomin for a great presenation!
@christina_yakomin For your technical platforms, what does your support model look like. Do you have a separate production support team, or does vanguard follow an SRE model where devs are also production support? Financial firms often have requirements of prod/dev segregation so maintain seperate teams.
Great question. Most of our platforms have an engineering team dedicated to building and supporting them. That team would be on-call for the platform. In the event that a platform outage causes application outages, the incident calls can get very large very quickly. The incident commanders help platform engineers with internal stakeholder communication while they troubleshoot and triage.
Our non-prod and prod environments are very segregated, but the same product teams troubleshoot both. We have lots of controls around production changes, many of them automated, including separation of duties (person who wrote the code/config change can’t deploy it). In a pinch, there is a dedicated team with more privileged access - though they still are subject to many controls - to handle manual changes to remediate incidents.
@christina_yakomin really nice talk. very relatable. thoroughly enjoyed it!
what was the mix of Accenture vs Fedex coaches?
@chris.gallivan278 we started with most of the coaches coming from Accenture | SIQ, but the goal was to “train the trainer” ASAP.
Current state is somewhere close to 50/50 at that particular program, @matthew.pegge correct me if I am wrong.
"Stephen Smith published a safety check article". I don't remember doing that. Was it good? 🎉
If anyone is interested in the Safety Check mentioned -> https://stevenmsmith.com/ar-safety-check/
Depends on the team maturity and team coach, but in my experience never worked effectively with having more than 2 teams/coach.
sounds familiar. many coaches say "1" when I ask this question. The best coaches usually say "1"
@ilia.shakitko I don't understand "extending traditional CD capabilities to support enterprise functions". I know CD pretty well, Where did you feel it needed extending?
especially considering the knowledge reinforcement… teams are stepping back when there is a massive change they are undergoing. and we had to give them time to absorbe and make that step back. So it’s an illusion if you want to go like stairs, 2, 2, 2, 2,… not linear….
@steve.smith what I meant, from my experience there is always CI/CD process is implemented to support the software part of it…
But we’ve incorporated automation pieces and tools to have audit, compliance, change management, and CAB (yeah…) requirements into the pipeline.
I see... I think CD is much more than just automation, auditing/change management are always a part of it!
I am not saying that it is never done, but in large solutions and legacy landscapes, in what I’ve seen teams are struggling to get out of only build&deploy automation activities.
Dave/Jez covered that in 2010 book, and I've always seen governance/service management as the much bigger part of it. I'm sure you have to
@bryan.kemp good question, can’t really share a ready-to-use playbook here… I am using some of the insights from the Theory of Constrains + looking at current team maturity state. And adjusting based on the flow observation. Also, depends on where the bottleneck was initially found - there initial WIP limit can be lowered first (while other parts can be left at ration 1:1 to amount of team members)…. That really helps to expose issues and let people shout out and start reaching other folks to see how they can help.
If you look after the talk into the presentation, you’ll see in initial slide there is a delivery street mapped, with 3 main metrics (Delay, ProcessTime, and % Complete & Accurate). This is a good start, to answer your question better (w.r.t metrics) @bryan.kemp
@ilia.shakitko @matthew.pegge - Thank you for awesome talk. How are you managing the flagging?
@bryan.kemp it's more about replicating the success we had with this particular team across the other ART's and VS's. So selling the success and showing others what can be achieved, but hopefully shortcutting some of the challenges we had in the first case.
Do you mean what solution is used? Or whether it is working out well with Business? @amruta.raul
@amruta.raul well, for a certain reason we had to go fast and make simple feature API available in that particular product technology, available Out of Box. But there are variety of the OpenSource and Commercials options available. We had to go fast, while (new) things are being evaluated and being approved.
Business got the flexibility and actually using feature enablement. The “fear” of allowing teams to deploy to production - is not yet totally gone, but we are getting there. It’s like driving tesla - first few turns, you really scared and then you enjoy the road.
@lee.reid we didn’t determine, we measured. Based on the ALM work rejections (tickets moved from test/approve back to dev) and interviews with those who participate in value delivery. For some of the stages - it’s amount of failed tests, or amount of failed Release Candidates.