Fork me on GitHub
Andy Domeier16:12:58

๐Ÿ‘‹ Hi Everyone! I'm super excited for the side discussion during my talk today. Don't be shy in starting threads for discussion, hoping to learn a bunch from everyone and your own experiences on this topic! ๐ŸŽ‰

๐ŸŽ‰ 2
Jonathan DeMarks16:12:46

Looking forward to your talk @ajdomeier!

Andy Domeier16:12:56

Oh Hi Jonathan! I didn't know you were here, keep me honest! ๐Ÿ˜„

Andy Domeier16:12:15

/poll "As we get started today I'm curious to know how folks think about internal services your developers build on top of. How does your organization think about internally shared services and Infrastructure?" "Just Infrastructure that IT handles" "Nice to Have and Share" "Critical for Delivery" "No Opinions or Strategies"


Reminder: The breakout sessions are starting in 5 minutes. Get in front of your browser and start navigating your way to whichever session youโ€™re attending.

Ann Perry - IT Revolution17:12:02

Welcome @ajdomeier, here to present: "Productize Undifferentiated Engineering"

thankyou 1
๐Ÿ‘ 1
๐Ÿ‘‹ 1
Gary Lesage17:12:07

Like the "priority friction" phrase. The struggle is real.

โค๏ธ 1
Andy Domeier17:12:41

The struggle can often be a sign of success. Too many valuable things to work on. ๐Ÿ™‚ (That's how I try to feel good about it. ๐Ÿ˜‰ )

Diego Leรณn17:12:12

Hi I see that undifferentiated Engineering must be balance with certain specialisation, engineering mindset on developments it is good until a deep technical issue occurs. In the market Job Titles are still important and relevant, and it is difficult being only an โ€œEngineerโ€ to apply for a job (at least in Spain). In any case,, I can recognise that for the avoidance of knowledge silos it works really well.

Jonathan DeMarks17:12:08

Agreed. Even in the USA it's hard to get some things done without a manager or higher title.

Andy Domeier17:12:24

Consensus building can also help, but can be super time consuming

Jonathan DeMarks17:12:23

Nice talk Andy! To sum up what I got: โ€ข Know what you're building โ€ข Count the cost โ€ข Sell it โ€ข Have a humble team that's focused on getting to the goal โ—ฆ This is the most important aspect for "undifferentiated"; getting people to realize that we have "a tool" and though you may not think it's the "best tool" let's use it until we need something different.

โค๏ธ 1
thankyou 1
Andy Domeier17:12:00

Thanks Jonathan! Cheers!

Ann Perry - IT Revolution17:12:01

Welcome @bryan.pinos and @yar.savchenko, here to present: "Why Does Capital One Test in Production?"

thankyou 1
Topo Pal17:12:12

โค๏ธ 2
๐Ÿ˜† 5
๐Ÿ˜‚ 1
Gene Kim, ITREV, Program Chair17:12:06

โ€œboth planned andโ€ฆ no-noticeโ€ (!! I love the name of โ€œno notice!โ€œ). Hi, @bryan.kemp422 @yar.savchenko!

โค๏ธ 1
Ryan Savage17:12:59

trying to convince others that when i say something like โ€œtest in prodโ€ i dont mean โ€œonly test in prodโ€โ€ฆ..

โœ”๏ธ 1
Leaf (Jessica Roy), MassMutual17:12:06

Do you use test accounts in production? If so, how does that work in an environment where there is real money and real tax implications?

๐Ÿ‘ 1
Bryan Pinos22:12:22

When we do tests in production, it could be a number of things - either the test is non-intrusive - i.e. it doesn't actually impact the customers, or we use a mirror of the stack that receives the same traffic but doesn't interrupt the customer's experience.

๐Ÿ™ 1
Gene Kim, ITREV, Program Chair17:12:45

โ€œwe first like to do it in a controlled environment, as opposed in production, in the middle of the nightโ€

โค๏ธ 1
Chandan Gudla17:12:03

I like the App level, AZ and regional failures but what about whole AWS failure or all region failure?

Bryan Pinos22:12:55

Hi @vgudla that's a great callout. While it's extraordinarily challenging to test an AWS complete outage, we run table-top exercises that help us size how that would impact the business and what actions we would be able to take in that circumstance.

๐Ÿ‘ 1
Chandan Gudla03:12:41

I bring that up because many customers are choosing Hybrid cloud just for that one reason alone even though there are other many benefits. Red Hat OpenShift is one such solution that helps do that but is the most mature one IMHO!

Bryan Pinos13:12:04

Agreed, there are benefits of hybrid, but I think hybrid also limits how deep you can use some of the cloud native services of each provider to truly leverage the power of the cloud. So like everything, there are tradeoffs that have to be weighed.

Sabyasachi Chanda17:12:17

1. How do you handle side effects while performing this exercise? 2. How do you differentiate an actual failure event vs the chaos generated one during the exercise?

๐Ÿ‘† 2
Bryan Pinos22:12:28

Hi @sabyasachichanda! Side effects are going to happen. We measure them, and either abort the test if the side effect becomes too great, or we study them to see how we can make it more seamless the next time.

Bryan Pinos22:12:37

for your second question, we typically capture a steady state prior to any test, and then look for anomalies during the test, and finally ensure that things return to a steady state after the test has concluded.

Heather Bannon17:12:10

How do you work with teams to allocate capacity to address items you might find in your testing? Are they treated like production failures or is it up to teams to prioritize in their backlogs (might not be prioritized as high as a prod defect would)?

Bryan Pinos22:12:12

YES! We treat these findings like any of the other findings that occur as a result of our problem management process. This ensures that there is a control that requires teams to address the findings in a timely manner.

Sara Mazer-Federal CTO LaunchDarkly17:12:11

@vgudla I've recommended switching DR and primary regions quarterly to ensure you can handle DR failovers well. Also, at LaunchDarkly, we are moving away from a legacy database to Cockroach which will hopefully give us multi-region support

Jonathan DeMarks17:12:03

Very interesting, I've been looking at Cockroach for the same problem space (DR/HA); have you been pleased with the results of testing? Were there any references that informed planning and implementation?

Sara Mazer-Federal CTO LaunchDarkly17:12:31

we are still in the process of migrating off a legacy database but have gone all in with CRBD. We had one major incident with them, kinda a "split brain" cluster issue that was fairly significant but otherwise it has been fairly smooth sailing. If you have specific questions on planning and implementation, I'd be happy to connect you with our platforms team leader, just let me know!

Bryan Pinos22:12:42

โค๏ธ the approach of switch primary and DR's regularly. We not only do this for our applications, but we also do this for our processes. Let's say you normally use the AWS console for making changes in route53, we'll setup events where we're require teams to do it solely using the CLI. We do the same for other tools that we use in-house.

Bryan Pinos22:12:12

as far as data tier, we have had a lot of success using AWS database services like Aurora, Dynamo, and Neptune with their global table/database features.

Chandan Gudla17:12:08

@smazer that is great! I was curious if Capital one uses Hybrid / multi cloud as we cannot rely on one cloud as we cannot rely on one region. If they do, how can they do such failures in production?

Bryan Pinos22:12:52

Here's a great case study AWS did on Capital One: We leverage multiple regions within AWS to ensure that our services there remain always on for our customers.

Amit Roy17:12:33

How do make sure customers are not affected?

Bryan Pinos22:12:05

While we test in production, we do a lot of vetting prior to a test making in production. This ensures in a high confidence that customers wont' be impacted. We also plan for failure and are always ready to rollback a test if it becomes problematic.

Chandan Gudla17:12:03

Yeah that was my major concern especially with chaos experimenting in production

Sabyasachi Chanda17:12:52

How is this different from Stress Testing or Load Testing or Disaster Recovery Exercise?

Sara Mazer-Federal CTO LaunchDarkly17:12:28

funny, I just reread the Phoenix Project and remember how chaos testing did impact first. Then, later they saw a reduction in all incidents after implementing chaos testing ๐Ÿ™‚

โค๏ธ 1
Sabyasachi Chanda17:12:07

Do you involve third party service providers as well in the exercise? They may throttle to a spike.

Bryan Pinos23:12:57

Depending on the exercise and the relevance of the 3rd party, we would do involve 3rd parties.

Chandan Gudla17:12:59

Service Mesh (combination of multiple open source softwares like Istio) is a great one to manage chaos testing with live traffic view & control! OpenShift from Red Hat has this included and I have personally seen great joy from Developers and Testers

Brandon Baker (IT MGR - O'Reilly Auto Parts)17:12:55

Something I've forever struggled with is how to deal with load, not how to create it. ๐Ÿ˜ฌ Google ( ๐Ÿ‘€ ) has a propensity for crawling our site until it crumbles. But then how do you absorb that load? That's a specific example, but the idea can be used in general for chaos testing.


Great talk on chaos engineering @bryan.pinos @yar.savchenko thank u ๐Ÿ‘ ! You mentioned capacity planning - shaking out whether capacity is adequately configured. Curious - have you experimented with different autoscaling strategies?

Bryan Pinos23:12:54

Thanks Vikas! You bring up a good point. Because we like to be able to flip 100% of our traffic between regions as quickly as possible, we are mostly over provisioned in our environments, so scaling doesn't play as much of a role as it theoretically could.


Reminder: The breakout sessions are starting again in 5 minutes. Get in front of your browser and start navigating your way to whichever session youโ€™re attending.

Ann Perry - IT Revolution19:12:00

Next up is @evan, here to present: "Clean Handoff: Giving Devs the Power and Speed to Deploy Without the Power of Production"

Evan Chiu19:12:51

๐Ÿ‘‹ Hello everyone!

๐Ÿ‘‹ 5
Topo Pal19:12:57

Who manages Gitlab template @evan? A central team?

Evan Chiu19:12:07


๐Ÿ‘ 1
Evan Chiu19:12:43

We have a dedicated DevOps team, part focused on building new tools, integrations, part focused on onboarding development teams.

Vlad Ukis19:12:50

Did you go with templates from the beginning or did you have to transform to using the templates?

Evan Chiu19:12:32

Weโ€™ve added more templates and migrated teams onto those over time.

Evan Chiu19:12:01

Weโ€™re still migrating teams from their copy-pasted and lightly modified pipelines onto standardized templates.

Vlad Ukis19:12:54

Ah, then we are not alone with this ๐Ÿ˜„

๐Ÿ˜‚ 1
๐Ÿ™Œ 1
Ryan Taylor, Application Architect, Axim Geospatial19:12:48

Is the build pipeline and the deploy pipeline in the same repo for any given project?

Dipesh Bhatia19:12:11

@evan can you give more info on Security injected code review before deploy? is it Static or SCA or something else (pen test)?

Evan Chiu19:12:56

@ryanewtaylor, never. Dev teams own the build pipelines, which produce artifacts. Deploy projects are owned by centralized devops, and reference the artifacts via manifests.

Evan Chiu19:12:30

Security reviewed infrastructure is human review of the IAM policies at merge request time.

Ryan Taylor, Application Architect, Axim Geospatial19:12:11

I might have missed it (or maybe it's yet to be discussed), what event triggers a deploy operation?

Evan Chiu19:12:43

I donโ€™t think I said it explicitly, but itโ€™s the update of the manifest files in the deploy projects.

Topo Pal19:12:02

non-prod pipelines and prod pipelines are different or differently owned?

Ryan Taylor, Application Architect, Axim Geospatial19:12:12

Devs or DevOps update the manifest files?

Evan Chiu19:12:47

Weโ€™ll get to it in a minute, itโ€™s the custom service that updates the manifests.

๐Ÿ‘ 2
Vlad Ukis19:12:21

Service Now DevOps Module - is it custom-built?

Evan Chiu19:12:48

But for teams that arenโ€™t onboarded to the custom service yet, the devs can sometimes update the manifests for lower environments, and production support (Ops) will update the manifests for production.

๐Ÿ‘ 1
Diego Leรณn19:12:00

Veracode is too late in the process as developers must be aware of what they are breaking, before even creating the artifact

๐Ÿ‘ 1
Sabyasachi Chanda19:12:27

If any rollback/back out is needed does servicenow get updated?

Evan Chiu19:12:00

Kali does the official Veracode policy scan for compliance. Developers run sandbox scans throughout their development process to get the results early.

๐Ÿ‘ 1
Diego Leรณn19:12:07

Who is managing the flaws triage?

Evan Chiu19:12:38

The development teams own remediation.

Diego Leรณn19:12:01

I love from your presentation the branch environments (for brand new functionality over an existing platform), for safe and stable rollout. We tried this by switching quickly envs and speedup developments

๐ŸŽ‰ 1
Chris Donahue19:12:29

Thanks for sharing @evan

๐ŸŽ‰ 2
Evan Chiu19:12:05

Our rollback process for most teams is deploying the previous version of the artifacts, and is noted in the approval as part of the same change request.

๐Ÿ‘ 1
Sabyasachi Chanda19:12:42

So, the SNOW ticket status remains successful? or it changes as Rolled back or something similar?

Evan Chiu19:12:53

Iโ€™m not sure what the resulting Snow ticket status is after a failed or rejected deployment and rollback.

Sabyasachi Chanda19:12:03

Thanks for clarifying.

Ann Perry - IT Revolution19:12:00

Welcome, @vladyslav.ukis, presenting: "Establishing SRE Foundations: Aligning The Organization On Ops Concerns Using SRE Team Topologies"

Vlad Ukis19:12:23

Thanks, @annp! Hello everyone! My name is Vlad. Welcome to my talk on SRE Team Topologies. ๐Ÿ”” Thanks a lot for taking the time to attend! Looking forward to your questions throughout the presentation and beyond ๐Ÿ™‚

Diego Leรณn19:12:51

I missed the โ€œyou build itโ€ & โ€œyou run itโ€ DEVOPS spirit of developers ๐Ÿ™‚

Gene Kim, ITREV, Program Chair19:12:04

This is such a fun way of describing the dynamics at play between devs and SREs, @vladyslav.ukis !

Vlad Ukis19:12:13

Indeed ๐Ÿ™‚

Gene Kim, ITREV, Program Chair19:12:21

Are there any of those that work "worst?" โ€”ย I.e., least likely to align in a stable state?

Vlad Ukis19:12:39

you build it, ops run it

Gene Kim, ITREV, Program Chair19:12:15

Actually, when you phrase it that way, that doesn't actually seem like a valid configuration. :)

๐Ÿ˜† 1
Dipesh Bhatia19:12:20

@vladyslav.ukis what we have is: Dev Build, Ops runs it, Monitoring / Issues - SRE , Ops, Dev. Fun we are the worst !! only way is up !!

๐Ÿ”ฅ 1
๐Ÿ˜† 1
Vlad Ukis19:12:19

What is the difference between Ops and SRE in your org?

Diego Leรณn19:12:04

Ops are literally operating your system in prod, SRE might be more related to the tooling/system/provisioning

Vlad Ukis19:12:55

So, SREs do not run services, right?

Diego Leรณn19:12:09

it depends, if you want then to do so ๐Ÿ™‚

Diego Leรณn19:12:44

and how to does not burn people in the process

Vlad Ukis19:12:49

That is the thing: all these things are not standardized in the industry :)

Diego Leรณn19:12:19

They depends of the knowledge (how it is spread from DEV to OPS, system/app maturity, market you are serving (global, local, etc), and many more things, but IMHO it is a live monster you need to attack based on where you are

Vlad Ukis19:12:12

Do you have 1 model in your org or several models at work? In terms of responsibility split between Devs, Ops and SREs?

Diego Leรณn19:12:39

Mix and evolving

Diego Leรณn19:12:24

great presentation, @vladyslav.ukis many thanks!!!

Vlad Ukis19:12:44

Glad you enjoyed it.

Diego Leรณn19:12:54

once we start a new project, it is on the table how to operate that :thinking_face:

Vlad Ukis20:12:03

Exactly! We faced this so many times ๐Ÿ™‚ These questions prompted me to write "Establishing SRE Foundations" :)

๐Ÿ‘ 1
shaaron a alvares19:01:30

Hi @vladyslav.ukis , @leon, great convo here. Would one of you, or anyone in this Slack, have time, 30mn, to help me refine my understanding of these same areas, Ops Excellence, Service Ownership, SRE, Chaos Eng and how these models may fit/ intertwine? Thank you!! I can explain where I come from w the ask.

shaaron a alvares19:01:02

reading your book at the moment

Vlad Ukis06:04:44

sorry, reading this only now...

Vlad Ukis06:04:09

@shaaron.alvares are you still reading this here? ๐Ÿ™‚

Gene Kim, ITREV, Program Chair19:12:04

I love all the talks from Healthineers โ€”ย will you be talking about any of the special requirements of working in a heavily regulated space? (And if not, can you share any healthcare specific techniques you've come up with, that other industries might not have had to create?)

Vlad Ukis19:12:16

That will not be part of this talk. Will post something about this a bit later.

Gene Kim, ITREV, Program Chair19:12:54

Wonderful โ€”ย if you're interested in talking on that next year, we'd love that!!!

Vlad Ukis19:12:08

Cool ๐Ÿ™‚

Gene Kim, ITREV, Program Chair19:12:07

One of my favorite phrases from Dr. @cleng from Google SRE was about how the more senior you are in the SRE org, the more you must care about the SRE customers (the dev/product org).

๐Ÿ‘ 1
Jonathan Mailhot19:12:45

I really like those SRE identity triangles.

๐Ÿ‘ 1
๐Ÿค™ 1
Gene Kim, ITREV, Program Chair19:12:49

I love the "SRE gives pager back to Devs" pattern โ€”ย so great!

Vlad Ukis19:12:42

This is the trick to make the devs care about prod when implementing features even if they do not run the services themselves because it is done by SREs ๐Ÿ˜„

Topo Pal19:12:35

Great content @vladyslav.ukis

thankyou 1
๐Ÿ‘ 1
Vlad Ukis19:12:18

"Establishing SRE Foundations" book:

๐Ÿ”– 1
๐Ÿ‘ 1
Gene Kim, ITREV, Program Chair19:12:47

Thank you @vladyslav.ukis !!!!

๐Ÿ‘ 4
Gene Kim, ITREV, Program Chair19:12:59

And congrats on the book!

thankyou 1
Vlad Ukis19:12:05

Thank you all very much for a great conversation!

๐Ÿ‘ 1
Vlad Ukis19:12:39

If you have any questions, reach out any time!

Vlad Ukis19:12:59

Linked-In is a good place to keep in touch for me.

๐Ÿ‘ 1
Chris Donahue19:12:18

Thanks @vladyslav.ukis I enjoyed your talk.

Vlad Ukis20:12:16

Thanks, @chris882!


Reminder: The plenary sessions are starting again in 5 minutes. Start making your way back to your browser and join us in #discussion-plenary to interact live with the speakers and other attendees.


Reminder: Please submit your feedback for the talks you attended. Itโ€™s so valuable for us and the speakers. And after all, feedback is a gift and sharing is caring! Enter your feedback for those talks here: