This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
๐ Hi Everyone! I'm super excited for the side discussion during my talk today. Don't be shy in starting threads for discussion, hoping to learn a bunch from everyone and your own experiences on this topic! ๐
/poll "As we get started today I'm curious to know how folks think about internal services your developers build on top of. How does your organization think about internally shared services and Infrastructure?" "Just Infrastructure that IT handles" "Nice to Have and Share" "Critical for Delivery" "No Opinions or Strategies"
Reminder: The breakout sessions are starting in 5 minutes. Get in front of your browser and start navigating your way to whichever session youโre attending. https://devopsenterprise.slack.com/files/UATE4LJ94/F04DG604H1C/image.png
Welcome @ajdomeier, here to present: "Productize Undifferentiated Engineering"
The struggle can often be a sign of success. Too many valuable things to work on. ๐ (That's how I try to feel good about it. ๐ )
Hi I see that undifferentiated Engineering must be balance with certain specialisation, engineering mindset on developments it is good until a deep technical issue occurs. In the market Job Titles are still important and relevant, and it is difficult being only an โEngineerโ to apply for a job (at least in Spain). In any case,, I can recognise that for the avoidance of knowledge silos it works really well.
Agreed. Even in the USA it's hard to get some things done without a manager or higher title.
Nice talk Andy! To sum up what I got: โข Know what you're building โข Count the cost โข Sell it โข Have a humble team that's focused on getting to the goal โฆ This is the most important aspect for "undifferentiated"; getting people to realize that we have "a tool" and though you may not think it's the "best tool" let's use it until we need something different.
Welcome @bryan.pinos and @yar.savchenko, here to present: "Why Does Capital One Test in Production?"
โboth planned andโฆ no-noticeโ (!! I love the name of โno notice!โ). Hi, @bryan.kemp422 @yar.savchenko!
trying to convince others that when i say something like โtest in prodโ i dont mean โonly test in prodโโฆ..
Do you use test accounts in production? If so, how does that work in an environment where there is real money and real tax implications?
When we do tests in production, it could be a number of things - either the test is non-intrusive - i.e. it doesn't actually impact the customers, or we use a mirror of the stack that receives the same traffic but doesn't interrupt the customer's experience.
โwe first like to do it in a controlled environment, as opposed in production, in the middle of the nightโ
I like the App level, AZ and regional failures but what about whole AWS failure or all region failure?
Hi @vgudla that's a great callout. While it's extraordinarily challenging to test an AWS complete outage, we run table-top exercises that help us size how that would impact the business and what actions we would be able to take in that circumstance.
I bring that up because many customers are choosing Hybrid cloud just for that one reason alone even though there are other many benefits. Red Hat OpenShift is one such solution that helps do that but is the most mature one IMHO!
Agreed, there are benefits of hybrid, but I think hybrid also limits how deep you can use some of the cloud native services of each provider to truly leverage the power of the cloud. So like everything, there are tradeoffs that have to be weighed.
1. How do you handle side effects while performing this exercise? 2. How do you differentiate an actual failure event vs the chaos generated one during the exercise?
Hi @sabyasachichanda! Side effects are going to happen. We measure them, and either abort the test if the side effect becomes too great, or we study them to see how we can make it more seamless the next time.
for your second question, we typically capture a steady state prior to any test, and then look for anomalies during the test, and finally ensure that things return to a steady state after the test has concluded.
How do you work with teams to allocate capacity to address items you might find in your testing? Are they treated like production failures or is it up to teams to prioritize in their backlogs (might not be prioritized as high as a prod defect would)?
YES! We treat these findings like any of the other findings that occur as a result of our problem management process. This ensures that there is a control that requires teams to address the findings in a timely manner.
@vgudla I've recommended switching DR and primary regions quarterly to ensure you can handle DR failovers well. Also, at LaunchDarkly, we are moving away from a legacy database to Cockroach which will hopefully give us multi-region support
Very interesting, I've been looking at Cockroach for the same problem space (DR/HA); have you been pleased with the results of testing? Were there any references that informed planning and implementation?
we are still in the process of migrating off a legacy database but have gone all in with CRBD. We had one major incident with them, kinda a "split brain" cluster issue that was fairly significant but otherwise it has been fairly smooth sailing. If you have specific questions on planning and implementation, I'd be happy to connect you with our platforms team leader, just let me know!
โค๏ธ the approach of switch primary and DR's regularly. We not only do this for our applications, but we also do this for our processes. Let's say you normally use the AWS console for making changes in route53, we'll setup events where we're require teams to do it solely using the CLI. We do the same for other tools that we use in-house.
as far as data tier, we have had a lot of success using AWS database services like Aurora, Dynamo, and Neptune with their global table/database features.
@smazer that is great! I was curious if Capital one uses Hybrid / multi cloud as we cannot rely on one cloud as we cannot rely on one region. If they do, how can they do such failures in production?
Here's a great case study AWS did on Capital One: https://aws.amazon.com/solutions/case-studies/capital-one-all-in-on-aws/ We leverage multiple regions within AWS to ensure that our services there remain always on for our customers.
While we test in production, we do a lot of vetting prior to a test making in production. This ensures in a high confidence that customers wont' be impacted. We also plan for failure and are always ready to rollback a test if it becomes problematic.
Yeah that was my major concern especially with chaos experimenting in production
How is this different from Stress Testing or Load Testing or Disaster Recovery Exercise?
funny, I just reread the Phoenix Project and remember how chaos testing did impact users...at first. Then, later they saw a reduction in all incidents after implementing chaos testing ๐
Do you involve third party service providers as well in the exercise? They may throttle to a spike.
Depending on the exercise and the relevance of the 3rd party, we would do involve 3rd parties.
Service Mesh (combination of multiple open source softwares like Istio) is a great one to manage chaos testing with live traffic view & control! OpenShift from Red Hat has this included and I have personally seen great joy from Developers and Testers
Something I've forever struggled with is how to deal with load, not how to create it. ๐ฌ Google ( ๐ ) has a propensity for crawling our site until it crumbles. But then how do you absorb that load? That's a specific example, but the idea can be used in general for chaos testing.
@bbaker8 Maybe look at this: https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94
Great talk on chaos engineering @bryan.pinos @yar.savchenko thank u ๐ ! You mentioned capacity planning - shaking out whether capacity is adequately configured. Curious - have you experimented with different autoscaling strategies?
Thanks Vikas! You bring up a good point. Because we like to be able to flip 100% of our traffic between regions as quickly as possible, we are mostly over provisioned in our environments, so scaling doesn't play as much of a role as it theoretically could.
Reminder: The breakout sessions are starting again in 5 minutes. Get in front of your browser and start navigating your way to whichever session youโre attending. https://devopsenterprise.slack.com/files/UATE4LJ94/F04DG604H1C/image.png
Next up is @evan, here to present: "Clean Handoff: Giving Devs the Power and Speed to Deploy Without the Power of Production"
We have a dedicated DevOps team, part focused on building new tools, integrations, part focused on onboarding development teams.
Did you go with templates from the beginning or did you have to transform to using the templates?
Weโre still migrating teams from their copy-pasted and lightly modified pipelines onto standardized templates.
Is the build pipeline and the deploy pipeline in the same repo for any given project?
@evan can you give more info on Security injected code review before deploy? is it Static or SCA or something else (pen test)?
@ryanewtaylor, never. Dev teams own the build pipelines, which produce artifacts. Deploy projects are owned by centralized devops, and reference the artifacts via manifests.
Security reviewed infrastructure is human review of the IAM policies at merge request time.
I might have missed it (or maybe it's yet to be discussed), what event triggers a deploy operation?
I donโt think I said it explicitly, but itโs the update of the manifest files in the deploy projects.
Devs or DevOps update the manifest files?
Weโll get to it in a minute, itโs the custom service that updates the manifests.
But for teams that arenโt onboarded to the custom service yet, the devs can sometimes update the manifests for lower environments, and production support (Ops) will update the manifests for production.
https://www.servicenow.com/products/devops.html is a first party module for Service Now
Veracode is too late in the process as developers must be aware of what they are breaking, before even creating the artifact
Kali does the official Veracode policy scan for compliance. Developers run sandbox scans throughout their development process to get the results early.
I love from your presentation the branch environments (for brand new functionality over an existing platform), for safe and stable rollout. We tried this by switching quickly envs and speedup developments
Our rollback process for most teams is deploying the previous version of the artifacts, and is noted in the approval as part of the same change request.
So, the SNOW ticket status remains successful? or it changes as Rolled back or something similar?
Iโm not sure what the resulting Snow ticket status is after a failed or rejected deployment and rollback.
Welcome, @vladyslav.ukis, presenting: "Establishing SRE Foundations: Aligning The Organization On Ops Concerns Using SRE Team Topologies"
Thanks, @annp! Hello everyone! My name is Vlad. Welcome to my talk on SRE Team Topologies. ๐ Thanks a lot for taking the time to attend! Looking forward to your questions throughout the presentation and beyond ๐
Here is where I work: https://www.siemens-healthineers.com/en-us/digital-health-solutions/teamplay-digital-health-platform
I missed the โyou build itโ & โyou run itโ DEVOPS spirit of developers ๐
This is such a fun way of describing the dynamics at play between devs and SREs, @vladyslav.ukis !
Are there any of those that work "worst?" โย I.e., least likely to align in a stable state?
Actually, when you phrase it that way, that doesn't actually seem like a valid configuration. :)
@vladyslav.ukis what we have is: Dev Build, Ops runs it, Monitoring / Issues - SRE , Ops, Dev. Fun we are the worst !! only way is up !!
Ops are literally operating your system in prod, SRE might be more related to the tooling/system/provisioning
They depends of the knowledge (how it is spread from DEV to OPS, system/app maturity, market you are serving (global, local, etc), and many more things, but IMHO it is a live monster you need to attack based on where you are
Do you have 1 model in your org or several models at work? In terms of responsibility split between Devs, Ops and SREs?
once we start a new project, it is on the table how to operate that :thinking_face:
Exactly! We faced this so many times ๐ These questions prompted me to write "Establishing SRE Foundations" :)
Hi @vladyslav.ukis , @leon, great convo here. Would one of you, or anyone in this Slack, have time, 30mn, to help me refine my understanding of these same areas, Ops Excellence, Service Ownership, SRE, Chaos Eng and how these models may fit/ intertwine? Thank you!! I can explain where I come from w the ask.
I love all the talks from Healthineers โย will you be talking about any of the special requirements of working in a heavily regulated space? (And if not, can you share any healthcare specific techniques you've come up with, that other industries might not have had to create?)
Wonderful โย if you're interested in talking on that next year, we'd love that!!!
One of my favorite phrases from Dr. @cleng from Google SRE was about how the more senior you are in the SRE org, the more you must care about the SRE customers (the dev/product org).
I love the "SRE gives pager back to Devs" pattern โย so great!
This is the trick to make the devs care about prod when implementing features even if they do not run the services themselves because it is done by SREs ๐
"Establishing SRE Foundations" book: https://www.amazon.de/-/en/Vladyslav-Ukis/dp/0137424604
Reminder: The plenary sessions are starting again in 5 minutes. Start making your way back to your browser and join us in #discussion-plenary to interact live with the speakers and other attendees. https://devopsenterprise.slack.com/files/UATE4LJ94/F04DG604H1C/image.png
Reminder: Please submit your feedback for the talks you attended. Itโs so valuable for us and the speakers. And after all, feedback is a gift and sharing is caring! Enter your feedback for those talks here: https://doesus2022.sched.com/ https://devopsenterprise.slack.com/files/UATE4LJ94/F04DG7DQMSS/image.png