ask-the-speaker-track-2 2020-10-15 | Devops Enterprise Summit Slack Archive

It was a pretty big struggle culturally because folks are pretty use to working a certain way with mainframe. One thing that really made a difference for us has been moving all our mainframe code into Git and building automation against mainframe code using IDz and Jenkins.

Matt Masuda - Quicken Loans18:10:00

Can you guys give me some examples of what kind of things you are doing with GitOps? I've been hearing a lot about it but I'm not super clear on how I would apply it to a typical dev team.

Lucas Melo (American Airlines Architect)18:10:50

One thing we've been doing more is to focus collaboration on GitHub PRs/Issue and use the review/approcal process of a PR as the trigger to kickoff deployments PR Merge = Deploy

Jeremy Castle18:10:12

that's exactly what we've been trying to do

Matt Masuda - Quicken Loans18:10:28

Okay. I think we are moving in the same direction. My team has one repo where a PR merge results in a deploy to an environment, with the explicit understanding that the review/approval means it's good to go.

Matt Masuda - Quicken Loans18:10:51

This slide is hilarious.

Scott Dedoes18:10:04

As you released features into production were you using feature flags to kill features?

Jeremy Castle18:10:46

Sure. We've decided to make GitOps our primary change management mechanism for our Kubernetes, PCF/Tanzu, and AWS based platforms. What that means is git also stores how the configuration of the environment looks in a config repo and is tied to TFE. We use GitLab Runners to power the pipelines. When changes are ready to go to a test environment or production the merge request serves as approval by management. That means teams no longer have to leave GitLab.

👍 3

Jeremy Castle18:10:14

We are not as mature with feature flags as we could be. We do have pockets of it, but I think it's a problem we need to solve better for our teams.

Scott Dedoes18:10:49

Where do feature flags fit in the priority of needs on your transformation journey? Curious what would factor into your decision to build your own vs. look at third parties?

Jeremy Castle18:10:56

I'd be open to buying something to solve that problem vs rolling our own.

Scott Dedoes18:10:29

Would you say it's a short term or long term priority? This is my first DOES and I'm new to the DevOps space as well so had assumed most large enterprise companies who had undergone digital transformation were already implementing FF at least with basic on/off capabilities but I'm learning that isn't necessarily true so this has been a great learning experience.

Derick Stenftenagel - Director - Cloud and Platform Svcs, Edward Jones18:10:19

@andy.hinegardner.slw2 @x95castle1 do you have challenges with too much alerting or alerting that isn’t actionable? Do you have best practices for managing those challenges?

Scott Dedoes18:10:37

What APM tools are in your tech stack for alerting and monitoring?

EmanuelMedina - Bancolombia18:10:41

Just a random comment, if you are looking for an opensource env, check for this, https://openapm.io/landscape, and opentelemetry is a standard, so, you could integrate with private tools like instana

👍 1

andyhinegardner18:10:55

Good question... We work closely with the teams responsible for the solutions to build out alerts and thresholds. We really try to only deliver actionable alerts. If you can't take action they just end up getting ignored. Alerts are different from Notifications! 🙂

👍 2

Scott Preister18:10:05

Jeremy or Andy, you talk about being a large organization with a mainframe. Does State Farm run legacy batch Cobol? The organization I work for has many .Net applications that work very well with all the new technologies but trying to determine how we get our Legacy batch Cobol programs more into a CICD Pipeline.

Jeremy Castle18:10:37

We do have legacy batch Cobol. I can get you in contact with our ARchitect in charge of Mainframe Cobol if you want?

Scott Preister18:10:31

@x95castle1, that would be great. we are always looking on recommendations on how to move legacy forward. thanks

archana kataria18:10:09

which observability tools have helped you all achieve the goals for reliability? [11:33 AM] and also how have you trained developers to instrument code for enabling observability?

archana kataria18:10:38

large enterprises have lot of tools (tool fatigue) have u driven standardization or let teams use what they want?

Rikard Ottosson - Psychological Safety (People Not Tech Ltd)18:10:23

Thanks 👋

Derick Stenftenagel - Director - Cloud and Platform Svcs, Edward Jones18:10:54

great preso guys, good follow-up to last year!

James (TeamForm) - helping teams at scale18:10:55

Thanks @x95castle1 @andy.hinegardner.slw2 great case study

andyhinegardner18:10:00

@scott.dedoes We pretty much use all of them. jk We do have a bunch from open source to Vendor products. The stack my team maintains is on the Open source side. Prometheus, Grafanna etc. We also have tools like Dynatrace, Datadog and some other big legacy solutions.

👍 1

Scott Dedoes18:10:08

Thanks for the great talk @x95castle1 and @andy.hinegardner.slw2 This has been very informative into the insight behind large organizations mindset and steps along their digital transformation journey!

andyhinegardner18:10:02

@archana_kataria We use our LATTES solution I mentioned as well as Dynatrace. We do have some growing to do in showing increased reliability. Right now we are focusing on reducing outages or impact duration. We are still working on getting Devs onboard but we help them by delivering easy to use/digest code snippets or packages solutions around observability. Great ? on the lots of tooling question! We recently put a Director and team on point to help with tool consolidation. We are still early in that journey but my SRE team is heavily involved in providing input. 🙂

Lino E Carrillo18:10:10

@x95castle1 @andy.hinegardner.slw2 great talk guys! I too work on the Ops side, but for a managed platform that hosts systems for multiple insurance companies. I'm curious, have you guys been able to move away from off-business hours deployments? Is it even possible in the insurance industry?

Jeremy Castle18:10:55

We actually deploy 100's of times a day to production at all times.

Lino E Carrillo19:10:42

We do a very modest monthly release cycle, with the occasional weekly based on urgency. Better than what it used to be, but lots of room for improvement still. Thanks!

Dave Mangot - DevOps transformation professional18:10:46

:flag-co: !

❤️ 1

👻 1

Dave Mangot - DevOps transformation professional18:10:08

Ooof, moving from mainframe to cloud, no small undertaking

Virginia Laurenzano NSA18:10:12

appreciate the mainframe reference! architecture so matters.

andyhinegardner18:10:55

@linoe13 We have moved away from off hours deployment for most solutions. Small/frequent deployments with the ability to roll back quickly is key. We have some apps that are heavily integrated so they are more challenging. Our goal is to get folks to A/B or Canary deployments where we can bleed traffic to the new app until we fully transition to it. With that said though, to do that the app really needs to design for that approach.

Camilo Piedrahita - Bancolombia - IT Manager18:10:49

it's completely right, architecture and design matter. we need to have modern applications to get all the benefit from DevOps

Lino E Carrillo19:10:03

Agreed! Parts of our solution + architecture are unable to avoid 3-5 hours of downtime for deployments. My gut tells me we should first strive for now down-time, then work on reducing the time it takes to deploy. Lots of improvements still, perhaps the simplest is simplifying our deployment process so that apps that don't require this downtime can be deployed separately outside of these monthly "ceremonies" we essentially perform

Dave Mangot - DevOps transformation professional18:10:52

Great improvements! Do the developers also deploy their own code? (like they test their own code)

Camilo Piedrahita - Bancolombia - IT Manager18:10:18

yes, the developers has the responsability for their test, their code and infrastructure

Dave Mangot - DevOps transformation professional18:10:54

To production? Are code reviews manadatory?

Camilo Piedrahita - Bancolombia - IT Manager18:10:59

yes, we included policies, gates and different practices for quality assurance in the pipeline 😄

Dave Mangot - DevOps transformation professional18:10:19

:robot_face:

Matt Masuda - Quicken Loans18:10:27

Whoa!!

Virginia Laurenzano NSA18:10:39

robots. jealous!

Camilo Piedrahita - Bancolombia - IT Manager18:10:08

2 years ago, we needed 18 days for test our ATM's, so we decided to include it in the pipeline

Virginia Laurenzano NSA18:10:44

makes me think of Ikea. I love it.

Virginia Laurenzano NSA18:10:46

and a CapitalOne/Hygieia/FOSS shout out. such a great story

Matt Masuda - Quicken Loans18:10:54

I wonder if any of my friends at AA have seen that? They could use robots for testing their airport kiosks.

😁 1

Santiago Cardona18:10:20

We're looking to use Robots to test other devices like IVR, PAC, POS, etc.

🔥 1

Dave Mangot - DevOps transformation professional18:10:23

Do you provide the developers with "blessed" base Docker images?

EmanuelMedina - Bancolombia18:10:32

Yes, we have ours “blessed” images in artifactory, and we block traffic to cloud registries like dockerhub

❤️ 1

Dave Mangot - DevOps transformation professional18:10:49

Are the drills blockers for deployment?

Rafael Alvarez [Fluid Attacks] CTO - Co-Founder18:10:26

It's a decision from the dev team, accepting the risk for life or temporarily! You will see in the next part that Forces agent on the CI block or non-block according to your risk criteria. But the main point with Drills is that are real hackers and confirmed vulnerabilities.

✅ 1

Dave Mangot - DevOps transformation professional18:10:36

Square used to do something similar to Integrates, good to see it happening elsewhere

Camilo Piedrahita - Bancolombia - IT Manager19:10:22

we need to promote the communication between developers and hackers...and of course, security to shift left

Camilo Piedrahita - Bancolombia - IT Manager19:10:14

thanks a lot for watching this breakout session. I hope you enjoyed it and appreciate our work 😄

👏 2

👍 1

Dave Mangot - DevOps transformation professional19:10:20

Ooof, now you have to secure the executive visibility mobile app! 🙂

🙌 1

Rafael Alvarez [Fluid Attacks] CTO - Co-Founder19:10:07

Yes! Its same with the applications that store your vulnerabiltiies, what ever it is!

💯 1

Matt Masuda - Quicken Loans19:10:32

👏

🙌 1

Rafael Alvarez [Fluid Attacks] CTO - Co-Founder19:10:27

Thanks Matt

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:52

Great talk, folks! Tough act to follow!

Dave Mangot - DevOps transformation professional19:10:36

I thought the stopping deployment based on risk profile was 💯 !

Rafael Alvarez [Fluid Attacks] CTO - Co-Founder19:10:44

It is, at the end, devops teams should have ownership, and that is included accepting the risk themselves and confronting the implications with everyone

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:46

Greetings! Welcome to "Low Context DevOps: 3 Ways to End Knowledge Frustration"!

🎉 1

👍 1

👋 1

Jess Meyer - IT Revolution (she/her)19:10:54

Welcome @tal!

Jack Vinson - flow19:10:19

"How was I supposed to know that" !!!! You don't know how many times that has come out of my mouth.

👍 1

Dave Mangot - DevOps transformation professional19:10:17

subliminal messages in the interview process!! 😆

Dave Mangot - DevOps transformation professional19:10:46

like the Matrix, "I know Kung Fu!"

Paula Thrasher - PagerDuty19:10:28

😂 Penn Station observation spot on

Joy from Stack Overflow19:10:59

(that was my favorite story)

➕ 1

Roman Pickl - technical pm - Elektrobit19:10:44

it still puzzles me that the person talking is answering questions at the same time 😄

Paula Thrasher - PagerDuty19:10:26

The terror of speaking in front of an audience is now replaced by the horror of watching yourself on video 😂

😂 2

🎯 1

Joy from Stack Overflow19:10:55

snort

Dave Mangot - DevOps transformation professional19:10:08

I just keep nodding in agreement, so hard to see virtually

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:02

In defense of Penn Station, they are rolling out all new signage. (30 years late)

Paula Thrasher - PagerDuty19:10:41

There are many IT organizations giving this same excuse "We're working on it for a release in Q4..."

Frotz Faatuai (Cisco IT - he/him)19:10:06

My wiki has an onboarding page. I have a conversation with everyone who comes into the team and the statement is: • This page should get you going. • You are responsible for fixing whatever you find is wrong with it (“because I no longer care”). Moderately successful, though I infrequently walk through and update it to my standards.

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:12

Snover told me that at Microsoft they call this "Make right easy".

❤️ 1

Dave Mangot - DevOps transformation professional19:10:50

I say "Make the right way the easy way"

❤️ 1

Dave Mangot - DevOps transformation professional19:10:08

OpenSSL vs. LibreSSL?? Shots fired! 💥

😁 1

Paula Thrasher - PagerDuty19:10:18

But I think @tal made a good point too, as the leader you own replacing the light bulbs. The onus is not on the new people (or the other team or ...)

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:13

BTW: my SO was surprised I would use the term "lazy" as a good thing. But... I got into computers because I wanted a robot that does my job for me. Isn't that being lazy???

Paula Thrasher - PagerDuty19:10:10

Isn't the famous Larry Wall (creator of perl) quote - The three traits of every developer are Laziness, Hubris, and Procrastination

João Acabado - Principal Engineer - Sky UK19:10:18

hi, sorry for the off-topic, I'm trying to wrap my head around a C implementation of OpenSSL TLS Resumption, would there be anyone kind enough to share some advice on where should I start? My goal is to implement it in a Gnome library.

Dave Mangot - DevOps transformation professional19:10:10

I think @tal likes train stations 😂

👍 1

Jennifer Velasquez19:10:06

“not too much or too little”, exactly what I need.

Jack Vinson - flow19:10:18

They are great! And nice examples of information overload that can be helped by some context.

👍 1

Dave Mangot - DevOps transformation professional19:10:19

Lagom in Swedish

Matt Masuda - Quicken Loans19:10:16

"A.B.A.: Always Be Documenting"... am I missing something here?

😆 1

Paula Thrasher - PagerDuty19:10:45

I like the ABA acronym. It needs a Glengarry Glen Ross meme to go with it 😎

Dave Mangot - DevOps transformation professional19:10:19

@tal are you going to talk about how to maintain documentation to keep up with changes?

Matt Masuda - Quicken Loans19:10:24

:raised_hand:

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:27

ABA comes from my article "Manual Work is a Bug"

👍 3

Frotz Faatuai (Cisco IT - he/him)19:10:28

🙌

Ian Silverwood (IT Manager at Ubisoft)19:10:48

I love documentation... but I also realize that I am not in the norm at all 😛

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:49

Ironically, that article has my best explanation of what NOT to automate.

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:39

I consider a good procedure doc to be "automation lite". Often that's good enough. When it isn't enough, it becomes the spec that the engineer will need to write the automation.

👍 1

🤯 1

Matt Masuda - Quicken Loans19:10:32

Yep, that's what I was missing.

Jack Vinson - flow19:10:46

aaaaaaaa

Rob Ables19:10:54

LOL, the blank screen. "Where do I start"....

Shavant Thomas19:10:00

i wasn't ready I was standing up

😆 1

👀 1

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:06

For halloween I'm going dressed as a blank screen.

😆 2

😂 4

😱 1

😁 1

😅 1

Andrew Hughes - Manager, DevOps Service Delivery QA (TRIMEDX)19:10:09

😱

Tish19:10:25

HAHAHA... that was great!

Phil Jochimsen (UW-Madison)19:10:38

glad I was sitting down 😉

Tish19:10:38

I'm a VIM blank screen user. hehe

Matt Masuda - Quicken Loans19:10:42

YESS DARK MODE

Jack Vinson - flow19:10:44

URL referring to BSS in schools. https://www.hercampus.com/school/uprm/blank-page-syndrome-causes-symptoms-and-treatments

Frotz Faatuai (Cisco IT - he/him)19:10:47

“4am” — Times when I’m “stupid” — Account for that loss of cognition

🕓 1

😂 1

Joy from Stack Overflow19:10:08

I call that "Thursday"

Frotz Faatuai (Cisco IT - he/him)19:10:43

Or 2am playing console games on a personal day when you get called into a bridge… ;-}

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:50

"4am Tom" is the target audience for the docs I write. He's a cool dude but not nearly as cool as "10am Tom".

😂 3

Dave Mangot - DevOps transformation professional19:10:03

Thoughts on keeping documentation with the code vs. document repository?

João Acabado - Principal Engineer - Sky UK19:10:11

with code creates a barrier to entry if people are accostumed to a Wiki?

Dave Mangot - DevOps transformation professional19:10:14

But it keeps all the updates in one place instead of trying to make two different things match.

João Acabado - Principal Engineer - Sky UK19:10:06

there should be a bot to automate these suggestions

😄 1

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:18

The Stackoverflow for Teams product has a Slack integration that notices someone asking a question and will say, "Would you like to ask that on SO4Teams?" and present a button that will post it for you.

🔥 1

João Acabado - Principal Engineer - Sky UK19:10:48

didn't know there was this kind of integration for SO

João Acabado - Principal Engineer - Sky UK19:10:55

8 years ago I developed a Firefox extension to rewrite Stack Overflow links to a local proxy to do stats on which questions and tags people did look for in the company

❤️ 1

João Acabado - Principal Engineer - Sky UK19:10:09

it was pretty cool but people were not that comfortable using it, it still was fun to understand what some of us were trying to learn

Matt Masuda - Quicken Loans19:10:31

Or at least a template!!

Paula Thrasher - PagerDuty19:10:23

These are great tips @tal. I feel like I studied computer science/math explicitly to avoid writing. The struggle is real.

😆 1

👍 1

Jack Vinson - flow19:10:17

Does the 70-20-10 ratio apply here? Roughly 10% will write new docs, Roughly 20% might update, the rest are happy it is there...

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:23

Yes! That's why writing in small batches is so important. If everyone is always doing small updates, you've covered the 70.

Frotz Faatuai (Cisco IT - he/him)19:10:37

“It’s always me”… 😉

Matt Masuda - Quicken Loans19:10:18

in a similar vein, from Black Hawk Down: "I hate being dependable!"

Tish19:10:22

I love when I can answer an e-mail with "well if you look in the documentation on page X you'll find the answer you want"

Roman Pickl - technical pm - Elektrobit19:10:04

first follower is also important 🙂 https://www.youtube.com/watch?v=V_qO7NFp4-s

❤️ 2

Adam Eury - Nike - Release Deploy Lead19:10:55

I haven't seen this video used in a talk but I saw the original when it came out and loved it.

Roman Pickl - technical pm - Elektrobit19:10:06

i think i saw it as part of a keynote at devops pro moscow 2018

Thomas DuBuisson19:10:02

Is "making wrong hard" a suitable inverse of "making right easy"?

👍 3

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:37

Good question. I'm not sure it is. Engineers love a challenge.

Nick - developer at BNPP21:10:15

@tommd - I think this it is. But I would rather rephrase it to "make wrong boring" or "make wrong expensive".

Thomas DuBuisson19:10:18

I heard "make right easy" and thought "tests! analysis!"

Tish19:10:20

Great talk. thanks.

Stephen Magill [Sonatype]19:10:27

Great talk, @tal!

Jack Vinson - flow19:10:33

👏

Matt Masuda - Quicken Loans19:10:39

@tal this talk was 💯 . I'll be sharing this one with my team for sure!

Frotz Faatuai (Cisco IT - he/him)19:10:42

Thank you @tal for voicing the idea that Documentation is a desirable deliverable.

pcn19:10:44

Have you also made maintenance of docs easier somehow?

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:05

What kind of maintenance?

pcn19:10:32

Keeping them up-to-date, discoverable, make sure they're as testable as possible, etc.

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:07

Two things (1) create a culture of constant updates so that things don't get stale. People should "pull the anton cord" to update docs. Many places make it hard to update a doc.. you need approvals, readers don't have access to edit a doc so they have to file a bug. Instead, you have to make it friction-free to update docs. (2) Old docs need to be visible. Display the "last updated" date, change the color if a doc is old, etc.

🙌 3

Chris Hunt, SRE at Stack Overflow20:10:30

We're working on it. https://stackoverflow.blog/2020/09/28/migration-wiki-documentation-articles/

Jack Vinson - flow19:10:31

@tal Your early story was a reason my spouse dropped out of engineering entirely - it's hard and you have to figure it out

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:10

Yeah, I hate that about our industry. People conflate learning the hard way with the joy of learning. People need guard rails, not a ... what's that thing where you have to go through many challenges? A confidence course?

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:45

I like to have a yearly "documentation fixit day" where the team stops and reviews all docs. We give prizes to people that find the most obsolete docs, update the most, etc.

inactive19:10:40

I loved your talk, @tal!!!! It’s such an important concept!! Thank you!!!!

🙌 2

TomLimoncelli (he/him) Speaker Op Best Practices for April Fools19:10:55

Thank YOU for giving me the opportunity to speak!

Phil Jochimsen (UW-Madison)20:10:53

Nice talk @tal, and thank you again for making out to Madison last year to speak at our https://itproconf.wisc.edu/!

❤️ 1

👍 1

Nick - developer at BNPP20:10:57

@tal you have mentioned the templates as one of the helping technics for better documentation. Do you have some advice on how to create these templates? Some links that may help?

Mark Fuller21:10:49

No chat for this session?

John Roe21:10:37

we're here

Tony Ogden21:10:07

ready for any questions you may have

Mark Fuller21:10:00

Thanks @john.roe. Question around the feature teams - it seems like you all moved to micro services before aligning the teams. Is that true? And if so, would you do it the same if you did it again? Mostly what I see done is a reverse Conway which would change the team alignment first. I am not being critical, I am interested in the learnings.

John Roe21:10:16

It's true that for the most part, we started with teams, regardless of size.. most were setup in a "Scrum" like fashion, but managed many, many micro services

John Roe21:10:02

team alignment is something we are undertaking now, to better organize us around capabilities

👍 1

Tony Ogden21:10:21

To expand on the capability concept, we partnered with a business architect to break down tax preparation into business capabilities - expanding both our DIY and Assisted experience

Mark Fuller21:10:06

Thanks!

James Feudo21:10:01

I'm a long time user of TaxCut and this presentation has helped me appreciate all of the work done behind this program.

👏 1

💯 1

Tony Ogden21:10:56

Thanks for being a long time customer...I think you'll appreciate the experience investment you'll see as we mature on this journey

👍 1

James Feudo21:10:28

Looking forward to it. It's a great product. And excellent presentation - I'm impressed with how you approach all the changes you need to make each year and how fields cascade.

Rajat Sud (DevOps Evangelist - SBPASC, an affiliate of CareFirst) (Speaker)21:10:34

@john.roe, were your CoEs staffed exclusively? Or were they just voluntarily positions of stakeholders from feature teams?

John Roe21:10:31

volunteers only.

Tony Ogden21:10:50

We've come A LONG WAY since TaxCut (still think I have a t-shirt I've worn during the pandemic)

👍 1

James Feudo21:10:27

@tony.ogden I've used it for so long I still think of it as that.

🙂 1

Jess Meyer - IT Revolution (she/her)21:10:14

Thank you @john.roe and @tony.ogden!

👍 2

Tony Ogden21:10:10

Our pleasure! Thank you for allowing H&R Block to be participants in the 2020 DOES conference!

Jess Meyer - IT Revolution (she/her)21:10:16

Wonderful to have you as presenters.

Tony Ogden21:10:04

Hoping to be back next year and share more details of our transformation

Jason Yee - Speaker - Gremlin21:10:35

👋 Hey everyone. I’m happy to answer your questions on Chaos Engineering.

Jason Yee - Speaker - Gremlin21:10:00

Also lol. You can tell how much we didnt rehearse this talk. 😆

Nick - developer at BNPP21:10:08

How did you know?!!! )))

Garrin Ball - DevOps Leader - DDMI21:10:26

@john.roe and @tony.ogden Great talk. Did you find that your testing needs were unique to your organization? Or did you find yourself aligning to more industry guidlines?

John Roe21:10:18

We utlized industry guidelines to help us frame up how we should attack the problem. Taxes give us a unique situation, but not necessarily more than other in a heavily regulated industry.

Tony Ogden21:10:21

We continue to challenge ourselves as well to not think of Block as being unique and to leverage industry guidelines

Garrin Ball - DevOps Leader - DDMI21:10:03

The necessary automation of that must be pretty intense. I’m in the insurance space, so I can relate to a degree.

Tony Ogden21:10:04

Absolutely! It's a game changer for us given the late breaking regulatory changes that we have to support each tax season.

Garrin Ball - DevOps Leader - DDMI21:10:34

Any idea of roughly how many tests you need to run before a release?

Tony Ogden21:10:20

Great question and while it bothers me to reply with 'it depends', it does hold true. For example, states and federal are broken out separately so it depends on which states and federal entities are included in a particular release. For a single state, we typically run 15-20 regression state returns and hundreds of test cases specific to the regulatory changes. We've shifted though to running the entire test suite in an automated fashion and it has expedited our turn around.

Garrin Ball - DevOps Leader - DDMI21:10:31

very interesting. Thanks for sharing!

Matthew Simons21:10:23

👋

👋 1

Ricardo Viana21:10:58

Did you suggest to them possible solutions, or were you focused on detecting "ugly" solutions?

Matthew Simons21:10:11

Our findings were always paired with a suggested course of action. We wanted teams to take away something more than just shame. 🙂

👍 1

Ricardo Viana21:10:23

🙂

Ricardo Viana21:10:04

Did the teams call you guys or were you assigned to teams based on some criteria - and if so which? (If you're going to answer in the presentation, I will wait 🙂 )

Matthew Simons21:10:42

We were assigned to work with teams. At first, we tried to base this on a composite metric that tried to quantify and weigh impact to our end users. Shortly after we started operating, however, our executives basically said "Don't we know who the bad actors are? Don't we know which teams need help?"

Matthew Simons21:10:12

We had to admit that, yes, we had a good idea of which teams already needed the most help, but we made them promise us to go back to eventually using a real data-focused approach.

Matthew Simons21:10:23

so basically, our targets were chosen for us by R&D leadership

Ricardo Viana21:10:05

Okay... that makes sense to get the ball rolling. Did you eventually evolve to a data-driven approach to find teams that needed help (and with what)?

Matthew Simons21:10:31

We did not. 😞

Matthew Simons21:10:47

The experiment essentially ran its course -- I'll cover this in the section on "failures"

Ricardo Viana21:10:04

Okay, cool. Thanks for being candid about that 🙂

Chrystina Nguyen, Rhythmic Technologies21:10:05

Who came here for the title of the talk....? 👋

✋ 3

➕ 4

Scott Harris21:10:14

@matthew.simons at what point would you engage with teams?

Matthew Simons21:10:50

We basically kept a :shit: list, whose contents were curated by leaders within R&D. We were often deployed reactively -- teams moving up the list because they had recent quality issues that were relatively visible.

Scott Harris21:10:28

ok…thanks…ever get to a more proactive engagement??

Matthew Simons21:10:16

I really wanted to. I sort of address this in the upcoming "failures" section.

Scott Harris21:10:32

👍

Camilo Piedrahita - Bancolombia - IT Manager21:10:01

what do you think about transversal escenarios? for example, kill complete kubernetes cluster...it isn't continuous chaos because we'll generate unavailability of the platform. schedule chaos?

Jason Yee - Speaker - Gremlin21:10:13

I’m not quite sure what you mean by “transversal scenarios”. Do you mean doing things like killing a cluster to test zone/region/cloud failover?

Camilo Piedrahita - Bancolombia - IT Manager21:10:40

yes...

Jason Yee - Speaker - Gremlin21:10:48

I think that it’s extremely valuable. I think a lot of teams don’t test it enough and usually have poor understanding of how long it actually takes.

Jason Yee - Speaker - Gremlin21:10:39

I mean, AWS would tell you that it’s fast and easy. But until you do it a few times, it’s never going to be as smooth as you’ll need to be during an incident

Jason Yee - Speaker - Gremlin21:10:42

All that said, I advise that teams start small. You have to do the basic, e.g. can your service automatically restart, before you try on the large scale.

Camilo Piedrahita - Bancolombia - IT Manager21:10:10

of course, and they can say it...but we need to configure many things and its better to be prepared

Brad Nelson21:10:00

@jyee @matthew.simons so many great one liners... I'm going to have to go back and watch to capture some of them better.

👍 2

Nick - developer at BNPP21:10:01

Tell about the teams that you didn't made friends with.

Matthew Simons21:10:31

I can only think of one that didn't love us afterwards. The short version is that we highlighted problem areas that the tech lead on the team got his ego wrapped in too much. It got pretty political.

Jason Yee - Speaker - Gremlin21:10:17

Not the inquisition but I know that @matthew.simons does have ruthless efficiency. 😉

Nick - developer at BNPP21:10:45

I have missed the interesting part. Did you do Chaos testing in Prod or Staging?

Matthew Simons21:10:41

We have a sort of hybrid prod/staging environment, and we've focused our chaos efforts there.

Matthew Simons21:10:39

We use our own platform extensively internally in the course of our actual business, so we target this internal environment where real users are doing real business-value-add activities, but paying customers aren't in it.

Nick - developer at BNPP21:10:46

makes sense

Ricardo Viana21:10:37

"Empathy, collaboration and automation"

🙂 1

💯 1

❤️ 1

Brad Nelson21:10:48

Was there a specific Jedi mind trick you used to convince senior leadership it was a good idea to randomly take production down?

😀 1

Matthew Simons21:10:25

At Workiva we haven't unleashed chaos on paying customers yet. That's in the roadmap, but it may be a ways away. We're also lucky in that we have a sort of prod environment that real users are using, but that doesn't have paying customers on it. So we can do chaos in an environment that really closely mirrors our main prod environments without actually impacting paying customers.

👍 1

Craig Larsen - he/him - Solution Design Group Mpls21:10:06

I imagine this is the hardest sell. What senior leader thinks it's a good idea to randomly take down production? The sell must be around the value we get from that ... it makes us better, stronger, faster. And we practice in non-prod first. But still, there's a chance that taking down production won't be like taking down non-prod.

Brad Nelson21:10:34

Yeah, I struggle enough, even with data and research, to get senior leadership to give things a chance that would be direct cost benefits... let alone invest in something that could cost money in revenue at the promise of making up for it in the long run.

Jason Yee - Speaker - Gremlin21:10:40

@craig.larsen In working with our customers, the first step is to move away from the idea of random take downs and be very methodical and precise.

Jason Yee - Speaker - Gremlin21:10:20

But also yes, practice in pre-prod/staging first. Build up confidence before going into prod

👍 2

Brad Nelson21:10:59

Makes sense. Do you find that you can build a lot of resiliency in pre-pod environments?

Matthew Simons21:10:57

There's an industry risk axis that's important, too. I sort of generalize the ends of the spectrum as Netflix and NASA. If Netflix has a blip in prod, someone doesn't watch a show (sad). If NASA has a small blip in "prod", people can die and enormous amounts of capital and public trust go down the toilet. If you are NASA, you probably don't want chaos in "production", but you better damn sure be running chaos in as close to prod as you can.

💯 2

Brad Nelson21:10:37

LOL, yeah, makes sense. Lives vs money.

Brad Nelson21:10:11

Well, thanks for sharing your story!

😉 1

Matthew Simons21:10:17

For Workiva, we deal with pre-release financial data for 70% of the fortune 500, and mistakes in the data we handle in prod could literally cost livelihoods and trust in our platform would evaporate.

Matthew Simons21:10:33

So we go as close as we can.

👍 1

Jason Yee - Speaker - Gremlin21:10:31

Even before resiliency in pre-prod, most enterprises that I work with is just getting visibility. e.g. I’ve worked with teams and we spend a good chunk of early work using Chaos Engineering just to ensure their service is emitting useful metrics so they can simply know when it’s gone down/somethings gone wrong.

Brad Nelson21:10:06

Yep. I've certainly been involved in monitoring systems, visualizing workflows, value streams, performance metrics, etc. I've never had the opportunity to play with a Chaos Monkey.

Jason Yee - Speaker - Gremlin21:10:56

At the risk of sounding like a shill, Gremlin does have a http://gremlin.com/free if you ever want to try it. Though honestly, I started my Chaos Engineering work just using stress-ng and linux command line tools.

👍 1

Brad Nelson21:10:52

Good to know. If I get an opportunity I may check it out 🙂

Nick - developer at BNPP21:10:48

I have this feeling that cultural aspects of your work was more hard/interesting as opposed to chaos testing itself. It would be for me.

Jason Yee - Speaker - Gremlin21:10:36

Absolutely 💯 %

💯 1

Nick - developer at BNPP21:10:20

Thank you @matthew.simons @jyee . If you ever see my software - please do tell me in great details why exactly it sucks 🙂

🙏 2

💯 1

Matthew Simons21:10:07

I think there's an implicit contract that doing so would make us friends, right? 🙂

😆 2

Nick - developer at BNPP21:10:27

I wish I could say "Yes". But I know myself too well 🙂

2020-10-15

Channels