Fork me on GitHub
#discussion-more
<
2022-06-07
Steve Spear04:06:29

@nickeggleston @mik @mr.denver.martin @david.sol-llaven @jeff.gallimore @james.moverley @adam @ahunt @genek @nicole.forsythe @mr.denver.martin @mr.denver.martin @ahunt @mik @abhibansal60 FINDING THE SOURCE OF A TECHNICAL GLITCH: ADVICE NEEDED BACKGROUND: upgrading software applications > A client is doing “routine” digital transformation—cloud version of office products, upgrades on outlook, teams, and zoom, etc. PROBLEM: glitchy issues with internal and external e mail. > They’ve just experienced problems (the cause of which is not obvious). For example, for confidentiality purposes, some e mails should be encrypted, especially those going external to the firm. BUT…sometimes what should encrypted is arriving (outside) as not. Sometimes things are arriving as encrypted but can’t be unencrypted. DIAGNOSING TO ROOT CAUSE: data mining, tracking and recreating > Question: How do you find root cause so this can be fixed? We discussed several ideas. Are they reasonable and what are we missing? > > DATA MINING: Since these exchanges are all digital and occur in very large numbers, could the equivalent of labels or tags be identified, so you can find attributes of the messages that associate with defect free delivery or not? > > TRACKING: Presumably, as these messages travel, handoffs are marked as they transit one system to another. > QUESTION: is there a way to dig into the details of the defective transmission and find out when they went from fine to broken. e.g., > Handoff A to B: fine, B to C: fine, C to D: busted… > > RECREATING: The idea is that data mining and tracking give insights to causality, sufficient enough that the failure can be reliably recreated (and then reliably prevented). > QUESTION: other ideas?

David Sol13:06:13

In my experience, you have to proceed scientifically. You get all the data you can, study it, make hypotheses and test them, until you find the cause, and get a solution. You try to avoid preconceived ideas, and to be blameless helps that information flow and the solution to be applied earlier and easier. And every case is different.

Jim Moverley14:07:44

Hi @steve773 interesting issue here 😄 As a long time system engineer and designer I would consider a process of elimination (in attempt to isolate the issue) In amongst the failure cases there will be commonality 🙂 however one needs to consider all the moving parts in between. 1: If the encrypted nature of the email is very important.. shouldn't the encryption be verified (as checked it is in place) before it is allowed to transit out of the organisation? 2: what email systems are involved - are we sure all config is consistent across the enterprise mail system? 3: regarding defective email being received -> YES SMTP protocol will leave a readable header (most modern mail clients hide this - but you can find option to show "original header" on most clients) - this should show the systems that the mail has transited (the header WONT be encrypted - only the mail body!) - see https://datatracker.ietf.org/doc/html/rfc2076 4: with respect to part 3 above: the enterprise mail servers should be able to inject header tokens if you wish to add more info/markers for tracing Again from my point of view.. the more examples of it happening, would lead me on where to start looking, and testing to verify/reproduce : Also really interesting to consider issues at the "layer 8" - aka human side 😉

Denver Martin, Dir DevSecOps, he/him20:07:46

@steve773 I think you have been tracking me... those are my everyday issues. We created a bug "defect" tracker. Where we could capture all the things not working or seeming to have been deprecated based on performance. We then track down using at least 5 whys (I know this is what SRE teams do). If we need to go more than 5 we do that. We are trying to see if it is a symptom or a real issue. We look to see if there are any dependencies that will act differently if we fix the issue, sometimes in complex systems we have found that compensations in other areas were made and if fix something another item will break. We try to test in a lower environment at a similar load and scale if possible. Sometimes it is not possible, if there is a client involved and they are willing to allow us to redirect a small percentage of load then we will do that, if there is no client involved, then we will set up redirect 50% of the load and check if the fix worked. Then keep adding till we feel we have solved the issue. (Having said that, we always engage the engineers and developers to see if they had noticed anything different, jobs running long, builds needing more time, preformance going faster or slower than normal, etc..) they often will think of something that was a bit off but not enough to raise an alarm, that has often given us some insight. Like one time there was a new secret key updated, and not all the systems were pointing to the new key and were still using the old key at build time. so nothing was seen to be the issue till go live.