This page is not created by, affiliated with, or supported by Slack Technologies, Inc.
- # ask-the-speaker-plenary (1491)
- # ask-the-speaker-track-1 (437)
- # ask-the-speaker-track-2 (251)
- # ask-the-speaker-track-3 (122)
- # ask-the-speaker-track-4 (136)
- # birds-of-a-feather (16)
- # bof-american-airlines (3)
- # bof-arch-engineering-ops (3)
- # bof-covid-19-lessons (1)
- # bof-cust-biz-tech-divide (26)
- # bof-leadership-culture-learning (6)
- # bof-next-gen-ops (1)
- # bof-overcoming-old-wow (3)
- # bof-project-to-product (3)
- # bof-sec-audit-compliance-grc (11)
- # bof-transformation-journeys (4)
- # bof-working-with-data (1)
- # demos (57)
- # games (41)
- # general (199)
- # happy-hour (162)
- # hiring (12)
- # itrev-app (10)
- # lean-coffee (65)
- # project-to-product (3)
- # summit-help (96)
- # summit-stories (60)
- # xpo-atlassian (1)
- # xpo-delphix (48)
- # xpo-gitlab-all-in-one-devsecops (2)
- # xpo-infosys-enterprise-agile-devops (2)
- # xpo-instana (3)
- # xpo-itrevolution (1)
- # xpo-launchdarkly (10)
- # xpo-moogsoft (3)
- # xpo-muse (9)
- # xpo-nowsecure-mobile-devsecops (3)
- # xpo-opsani (5)
- # xpo-optimizely (1)
- # xpo-pagerduty (18)
- # xpo-pc-devops-qualifications (5)
- # xpo-plutora-vsm (1)
- # xpo-redgatesoftware-compliant-database-devops (9)
- # xpo-servicenow (1)
- # xpo-snyk (2)
- # xpo-sonatype (8)
- # xpo-split (9)
- # xpo-sysdig (25)
- # xpo-tasktop (4)
- # xpo-teamform-teamops-at-scale (6)
- # xpo-transposit (4)
@x95castle1 how did you implement DevOps mindset to the mainframe
Good question! We actually started with Value Stream Mapping the processes to production and just started tackling some of the low hanging fruit.
Do you have additional platforms that deploy in conjunction with mainframe changes?
but we've started using some IBM technologies to expose services on the mainframe as API's.
Our core mainframe system has to run on the mainframe and not an option to move to Git
come in Nov to Feb so it's not as hot - unless you like the heat
It was a pretty big struggle culturally because folks are pretty use to working a certain way with mainframe. One thing that really made a difference for us has been moving all our mainframe code into Git and building automation against mainframe code using IDz and Jenkins.
Can you guys give me some examples of what kind of things you are doing with GitOps? I've been hearing a lot about it but I'm not super clear on how I would apply it to a typical dev team.
One thing we've been doing more is to focus collaboration on GitHub PRs/Issue and use the review/approcal process of a PR as the trigger to kickoff deployments PR Merge = Deploy
Okay. I think we are moving in the same direction. My team has one repo where a PR merge results in a deploy to an environment, with the explicit understanding that the review/approval means it's good to go.
As you released features into production were you using feature flags to kill features?
Sure. We've decided to make GitOps our primary change management mechanism for our Kubernetes, PCF/Tanzu, and AWS based platforms. What that means is git also stores how the configuration of the environment looks in a config repo and is tied to TFE. We use GitLab Runners to power the pipelines. When changes are ready to go to a test environment or production the merge request serves as approval by management. That means teams no longer have to leave GitLab.
We are not as mature with feature flags as we could be. We do have pockets of it, but I think it's a problem we need to solve better for our teams.
Where do feature flags fit in the priority of needs on your transformation journey? Curious what would factor into your decision to build your own vs. look at third parties?
Would you say it's a short term or long term priority? This is my first DOES and I'm new to the DevOps space as well so had assumed most large enterprise companies who had undergone digital transformation were already implementing FF at least with basic on/off capabilities but I'm learning that isn't necessarily true so this has been a great learning experience.
@andy.hinegardner.slw2 @x95castle1 do you have challenges with too much alerting or alerting that isn’t actionable? Do you have best practices for managing those challenges?
Just a random comment, if you are looking for an opensource env, check for this, https://openapm.io/landscape, and opentelemetry is a standard, so, you could integrate with private tools like instana
Good question... We work closely with the teams responsible for the solutions to build out alerts and thresholds. We really try to only deliver actionable alerts. If you can't take action they just end up getting ignored. Alerts are different from Notifications! 🙂
Jeremy or Andy, you talk about being a large organization with a mainframe. Does State Farm run legacy batch Cobol? The organization I work for has many .Net applications that work very well with all the new technologies but trying to determine how we get our Legacy batch Cobol programs more into a CICD Pipeline.
We do have legacy batch Cobol. I can get you in contact with our ARchitect in charge of Mainframe Cobol if you want?
@x95castle1, that would be great. we are always looking on recommendations on how to move legacy forward. thanks
which observability tools have helped you all achieve the goals for reliability? [11:33 AM] and also how have you trained developers to instrument code for enabling observability?
large enterprises have lot of tools (tool fatigue) have u driven standardization or let teams use what they want?
great preso guys, good follow-up to last year!
Thanks @x95castle1 @andy.hinegardner.slw2 great case study
@scott.dedoes We pretty much use all of them. jk We do have a bunch from open source to Vendor products. The stack my team maintains is on the Open source side. Prometheus, Grafanna etc. We also have tools like Dynatrace, Datadog and some other big legacy solutions.
Thanks for the great talk @x95castle1 and @andy.hinegardner.slw2 This has been very informative into the insight behind large organizations mindset and steps along their digital transformation journey!
@archana_kataria We use our LATTES solution I mentioned as well as Dynatrace. We do have some growing to do in showing increased reliability. Right now we are focusing on reducing outages or impact duration. We are still working on getting Devs onboard but we help them by delivering easy to use/digest code snippets or packages solutions around observability. Great ? on the lots of tooling question! We recently put a Director and team on point to help with tool consolidation. We are still early in that journey but my SRE team is heavily involved in providing input. 🙂
@x95castle1 @andy.hinegardner.slw2 great talk guys! I too work on the Ops side, but for a managed platform that hosts systems for multiple insurance companies. I'm curious, have you guys been able to move away from off-business hours deployments? Is it even possible in the insurance industry?
We do a very modest monthly release cycle, with the occasional weekly based on urgency. Better than what it used to be, but lots of room for improvement still. Thanks!
Ooof, moving from mainframe to cloud, no small undertaking
appreciate the mainframe reference! architecture so matters.
@linoe13 We have moved away from off hours deployment for most solutions. Small/frequent deployments with the ability to roll back quickly is key. We have some apps that are heavily integrated so they are more challenging. Our goal is to get folks to A/B or Canary deployments where we can bleed traffic to the new app until we fully transition to it. With that said though, to do that the app really needs to design for that approach.
it's completely right, architecture and design matter. we need to have modern applications to get all the benefit from DevOps
Agreed! Parts of our solution + architecture are unable to avoid 3-5 hours of downtime for deployments. My gut tells me we should first strive for now down-time, then work on reducing the time it takes to deploy. Lots of improvements still, perhaps the simplest is simplifying our deployment process so that apps that don't require this downtime can be deployed separately outside of these monthly "ceremonies" we essentially perform
Great improvements! Do the developers also deploy their own code? (like they test their own code)
yes, the developers has the responsability for their test, their code and infrastructure
yes, we included policies, gates and different practices for quality assurance in the pipeline 😄
2 years ago, we needed 18 days for test our ATM's, so we decided to include it in the pipeline
and a CapitalOne/Hygieia/FOSS shout out. such a great story
I wonder if any of my friends at AA have seen that? They could use robots for testing their airport kiosks.
We're looking to use Robots to test other devices like IVR, PAC, POS, etc.
Do you provide the developers with "blessed" base Docker images?
Yes, we have ours “blessed” images in artifactory, and we block traffic to cloud registries like dockerhub
It's a decision from the dev team, accepting the risk for life or temporarily! You will see in the next part that Forces agent on the CI block or non-block according to your risk criteria. But the main point with Drills is that are real hackers and confirmed vulnerabilities.
Square used to do something similar to Integrates, good to see it happening elsewhere
we need to promote the communication between developers and hackers...and of course, security to shift left
thanks a lot for watching this breakout session. I hope you enjoyed it and appreciate our work 😄
Ooof, now you have to secure the executive visibility mobile app! 🙂
Yes! Its same with the applications that store your vulnerabiltiies, what ever it is!
I thought the stopping deployment based on risk profile was 💯 !
It is, at the end, devops teams should have ownership, and that is included accepting the risk themselves and confronting the implications with everyone
Greetings! Welcome to "Low Context DevOps: 3 Ways to End Knowledge Frustration"!
"How was I supposed to know that" !!!! You don't know how many times that has come out of my mouth.
subliminal messages in the interview process!! 😆
it still puzzles me that the person talking is answering questions at the same time 😄
The terror of speaking in front of an audience is now replaced by the horror of watching yourself on video 😂
I just keep nodding in agreement, so hard to see virtually
In defense of Penn Station, they are rolling out all new signage. (30 years late)
There are many IT organizations giving this same excuse "We're working on it for a release in Q4..."
My wiki has an onboarding page. I have a conversation with everyone who comes into the team and the statement is: • This page should get you going. • You are responsible for fixing whatever you find is wrong with it (“because I no longer care”). Moderately successful, though I infrequently walk through and update it to my standards.
Snover told me that at Microsoft they call this "Make right easy".
But I think @tal made a good point too, as the leader you own replacing the light bulbs. The onus is not on the new people (or the other team or ...)
BTW: my SO was surprised I would use the term "lazy" as a good thing. But... I got into computers because I wanted a robot that does my job for me. Isn't that being lazy???
Isn't the famous Larry Wall (creator of perl) quote - The three traits of every developer are Laziness, Hubris, and Procrastination
hi, sorry for the off-topic, I'm trying to wrap my head around a C implementation of OpenSSL TLS Resumption, would there be anyone kind enough to share some advice on where should I start? My goal is to implement it in a Gnome library.
They are great! And nice examples of information overload that can be helped by some context.
"A.B.A.: Always Be Documenting"... am I missing something here?
I like the ABA acronym. It needs a Glengarry Glen Ross meme to go with it 😎
@tal are you going to talk about how to maintain documentation to keep up with changes?
ABA comes from my article "Manual Work is a Bug"
I love documentation... but I also realize that I am not in the norm at all 😛
Ironically, that article has my best explanation of what NOT to automate.
I consider a good procedure doc to be "automation lite". Often that's good enough. When it isn't enough, it becomes the spec that the engineer will need to write the automation.
For halloween I'm going dressed as a blank screen.
URL referring to BSS in schools. https://www.hercampus.com/school/uprm/blank-page-syndrome-causes-symptoms-and-treatments
“4am” — Times when I’m “stupid” — Account for that loss of cognition
Or 2am playing console games on a personal day when you get called into a bridge… ;-}
"4am Tom" is the target audience for the docs I write. He's a cool dude but not nearly as cool as "10am Tom".
Thoughts on keeping documentation with the code vs. document repository?
with code creates a barrier to entry if people are accostumed to a Wiki?
But it keeps all the updates in one place instead of trying to make two different things match.
there should be a bot to automate these suggestions
The Stackoverflow for Teams product has a Slack integration that notices someone asking a question and will say, "Would you like to ask that on SO4Teams?" and present a button that will post it for you.
didn't know there was this kind of integration for SO
8 years ago I developed a Firefox extension to rewrite Stack Overflow links to a local proxy to do stats on which questions and tags people did look for in the company
it was pretty cool but people were not that comfortable using it, it still was fun to understand what some of us were trying to learn
These are great tips @tal. I feel like I studied computer science/math explicitly to avoid writing. The struggle is real.
Does the 70-20-10 ratio apply here? Roughly 10% will write new docs, Roughly 20% might update, the rest are happy it is there...
Yes! That's why writing in small batches is so important. If everyone is always doing small updates, you've covered the 70.
in a similar vein, from Black Hawk Down: "I hate being dependable!"
I love when I can answer an e-mail with "well if you look in the documentation on page X you'll find the answer you want"
first follower is also important 🙂 https://www.youtube.com/watch?v=V_qO7NFp4-s
I haven't seen this video used in a talk but I saw the original when it came out and loved it.
i think i saw it as part of a keynote at devops pro moscow 2018
Good question. I'm not sure it is. Engineers love a challenge.
@tommd - I think this it is. But I would rather rephrase it to "make wrong boring" or "make wrong expensive".
@tal this talk was 💯 . I'll be sharing this one with my team for sure!
Thank you @tal for voicing the idea that Documentation is a desirable deliverable.
Two things (1) create a culture of constant updates so that things don't get stale. People should "pull the anton cord" to update docs. Many places make it hard to update a doc.. you need approvals, readers don't have access to edit a doc so they have to file a bug. Instead, you have to make it friction-free to update docs. (2) Old docs need to be visible. Display the "last updated" date, change the color if a doc is old, etc.
We're working on it. https://stackoverflow.blog/2020/09/28/migration-wiki-documentation-articles/
@tal Your early story was a reason my spouse dropped out of engineering entirely - it's hard and you have to figure it out
Yeah, I hate that about our industry. People conflate learning the hard way with the joy of learning. People need guard rails, not a ... what's that thing where you have to go through many challenges? A confidence course?
I like to have a yearly "documentation fixit day" where the team stops and reviews all docs. We give prizes to people that find the most obsolete docs, update the most, etc.
Thank YOU for giving me the opportunity to speak!
Nice talk @tal, and thank you again for making out to Madison last year to speak at our https://itproconf.wisc.edu/!
@tal you have mentioned the templates as one of the helping technics for better documentation. Do you have some advice on how to create these templates? Some links that may help?
Thanks @john.roe. Question around the feature teams - it seems like you all moved to micro services before aligning the teams. Is that true? And if so, would you do it the same if you did it again? Mostly what I see done is a reverse Conway which would change the team alignment first. I am not being critical, I am interested in the learnings.
It's true that for the most part, we started with teams, regardless of size.. most were setup in a "Scrum" like fashion, but managed many, many micro services
team alignment is something we are undertaking now, to better organize us around capabilities
To expand on the capability concept, we partnered with a business architect to break down tax preparation into business capabilities - expanding both our DIY and Assisted experience
I'm a long time user of TaxCut and this presentation has helped me appreciate all of the work done behind this program.
Thanks for being a long time customer...I think you'll appreciate the experience investment you'll see as we mature on this journey
Looking forward to it. It's a great product. And excellent presentation - I'm impressed with how you approach all the changes you need to make each year and how fields cascade.
@john.roe, were your CoEs staffed exclusively? Or were they just voluntarily positions of stakeholders from feature teams?
We've come A LONG WAY since TaxCut (still think I have a t-shirt I've worn during the pandemic)
Our pleasure! Thank you for allowing H&R Block to be participants in the 2020 DOES conference!
👋 Hey everyone. I’m happy to answer your questions on Chaos Engineering.
@john.roe and @tony.ogden Great talk. Did you find that your testing needs were unique to your organization? Or did you find yourself aligning to more industry guidlines?
We utlized industry guidelines to help us frame up how we should attack the problem. Taxes give us a unique situation, but not necessarily more than other in a heavily regulated industry.
We continue to challenge ourselves as well to not think of Block as being unique and to leverage industry guidelines
The necessary automation of that must be pretty intense. I’m in the insurance space, so I can relate to a degree.
Absolutely! It's a game changer for us given the late breaking regulatory changes that we have to support each tax season.
Any idea of roughly how many tests you need to run before a release?
Great question and while it bothers me to reply with 'it depends', it does hold true. For example, states and federal are broken out separately so it depends on which states and federal entities are included in a particular release. For a single state, we typically run 15-20 regression state returns and hundreds of test cases specific to the regulatory changes. We've shifted though to running the entire test suite in an automated fashion and it has expedited our turn around.
Did you suggest to them possible solutions, or were you focused on detecting "ugly" solutions?
Our findings were always paired with a suggested course of action. We wanted teams to take away something more than just shame. 🙂
Did the teams call you guys or were you assigned to teams based on some criteria - and if so which? (If you're going to answer in the presentation, I will wait 🙂 )
We were assigned to work with teams. At first, we tried to base this on a composite metric that tried to quantify and weigh impact to our end users. Shortly after we started operating, however, our executives basically said "Don't we know who the bad actors are? Don't we know which teams need help?"
We had to admit that, yes, we had a good idea of which teams already needed the most help, but we made them promise us to go back to eventually using a real data-focused approach.
Okay... that makes sense to get the ball rolling. Did you eventually evolve to a data-driven approach to find teams that needed help (and with what)?
The experiment essentially ran its course -- I'll cover this in the section on "failures"
We basically kept a :shit: list, whose contents were curated by leaders within R&D. We were often deployed reactively -- teams moving up the list because they had recent quality issues that were relatively visible.
I really wanted to. I sort of address this in the upcoming "failures" section.
what do you think about transversal escenarios? for example, kill complete kubernetes cluster...it isn't continuous chaos because we'll generate unavailability of the platform. schedule chaos?
I’m not quite sure what you mean by “transversal scenarios”. Do you mean doing things like killing a cluster to test zone/region/cloud failover?
I think that it’s extremely valuable. I think a lot of teams don’t test it enough and usually have poor understanding of how long it actually takes.
I mean, AWS would tell you that it’s fast and easy. But until you do it a few times, it’s never going to be as smooth as you’ll need to be during an incident
All that said, I advise that teams start small. You have to do the basic, e.g. can your service automatically restart, before you try on the large scale.
of course, and they can say it...but we need to configure many things and its better to be prepared
@jyee @matthew.simons so many great one liners... I'm going to have to go back and watch to capture some of them better.
I can only think of one that didn't love us afterwards. The short version is that we highlighted problem areas that the tech lead on the team got his ego wrapped in too much. It got pretty political.
Not the inquisition but I know that @matthew.simons does have ruthless efficiency. 😉
I have missed the interesting part. Did you do Chaos testing in Prod or Staging?
We have a sort of hybrid prod/staging environment, and we've focused our chaos efforts there.
We use our own platform extensively internally in the course of our actual business, so we target this internal environment where real users are doing real business-value-add activities, but paying customers aren't in it.
Was there a specific Jedi mind trick you used to convince senior leadership it was a good idea to randomly take production down?
At Workiva we haven't unleashed chaos on paying customers yet. That's in the roadmap, but it may be a ways away. We're also lucky in that we have a sort of prod environment that real users are using, but that doesn't have paying customers on it. So we can do chaos in an environment that really closely mirrors our main prod environments without actually impacting paying customers.
I imagine this is the hardest sell. What senior leader thinks it's a good idea to randomly take down production? The sell must be around the value we get from that ... it makes us better, stronger, faster. And we practice in non-prod first. But still, there's a chance that taking down production won't be like taking down non-prod.
Yeah, I struggle enough, even with data and research, to get senior leadership to give things a chance that would be direct cost benefits... let alone invest in something that could cost money in revenue at the promise of making up for it in the long run.
@craig.larsen In working with our customers, the first step is to move away from the idea of random take downs and be very methodical and precise.
But also yes, practice in pre-prod/staging first. Build up confidence before going into prod
Makes sense. Do you find that you can build a lot of resiliency in pre-pod environments?
There's an industry risk axis that's important, too. I sort of generalize the ends of the spectrum as Netflix and NASA. If Netflix has a blip in prod, someone doesn't watch a show (sad). If NASA has a small blip in "prod", people can die and enormous amounts of capital and public trust go down the toilet. If you are NASA, you probably don't want chaos in "production", but you better damn sure be running chaos in as close to prod as you can.
For Workiva, we deal with pre-release financial data for 70% of the fortune 500, and mistakes in the data we handle in prod could literally cost livelihoods and trust in our platform would evaporate.
resiliency in pre-prod, most enterprises that I work with is just getting visibility. e.g. I’ve worked with teams and we spend a good chunk of early work using Chaos Engineering just to ensure their service is emitting useful metrics so they can simply know when it’s gone down/somethings gone wrong.
Yep. I've certainly been involved in monitoring systems, visualizing workflows, value streams, performance metrics, etc. I've never had the opportunity to play with a Chaos Monkey.
At the risk of sounding like a shill, Gremlin does have a http://gremlin.com/free if you ever want to try it. Though honestly, I started my Chaos Engineering work just using stress-ng and linux command line tools.
I have this feeling that cultural aspects of your work was more hard/interesting as opposed to chaos testing itself. It would be for me.
Thank you @matthew.simons @jyee . If you ever see my software - please do tell me in great details why exactly it sucks 🙂
I think there's an implicit contract that doing so would make us friends, right? 🙂