Monday morning, we all still recovering from the weekend. My coworker calls everyone to show the bug he found and how this was breaking our application. He asked me to open our retry implementation when he says:
- ► We don't need this retry mechanism, like at all, we only make internal calls and they always work, so the only way from our code to crash is from miss authentication. LETS DELETE IT
I kid you not, I was in shock.
How do you discuss with a point that diverge from common sense? This was something I wasn't prepared, that's not what retry mechanisms are for, and at the same time, what are retry mechanisms even for?
The most obvious reason is that HTTP is really not that reliable. Having failed requests due to infrastructure issue makes up for the majority of our errors, It's nothing we can control, infrastructure working on a large scale of load show us that 99% uptime is not that much.
So even though you have a very stable application, the connection to database, the ingress controller, the load balancer, the gateway, all those pieces of software that make our modern infrastructure introduce chaos to the system.
How we can control that chaos and increase the performance of an application? The simplest solution we have been relying for ages, was simply to retry. What is the best alternative here? Your application takes 3 seconds longer to load or showing an error page to the user?
Well that's my understanding of the problem, how I see the issue. Therefore, my view of it. How to explain it on a discussion where Newrelic was not finding errors, why would we need to retry? I tried to bring the argument that we don't see them because we DO retry right now, so this will not be visible.
By the way, this is even harder to track when your UI is making the requests, your users might be facing issues that you are completely unaware of.
Having retries in the frontend is paramount and the reason why UseQuery is growing in popularity, they handle out of the box all the caching, retry, and state of your requests. The difference between a high level application to a side project, that little detail that makes all the difference ✨

Currently the main pain we are feeling in our project is to connect to database, somehow azure is not the best at it (??), so most of the time a retry can fix the problem, we can ignore the error and move along, some seconds slower but still kicking.
Again, how to measure it? Could be skill issues? Definitely, should not have been so hard to track it, and sometimes you need to be reasonable on where you spend your money, or credits. I'm not here to track something that did not fail.
So I researched on reasons to not remove the commonest pattern in software development. What strike me was an article saying: "not all errors deserve to be retry". That's a point that I was overlooking, simply retry is not enough. I've applied some rules to our retry
- ► Only retry on 5xx errors
- ► Always increase the waiting time per request, as to give time to the server to recover
- ► Do not retry for more than three times
- ► Add a context to the retries, if the request timeouts, cancel the retry
Simple changes that prevent us from attacking our own server, and good practices in general. I really like the idea of challenging predefined concepts on our industry, but only when we are open to discuss it, Like How I fixed TDD and how you can too.
Now I know that we can make it even better with circuit breakers and other microservices, but how this is implemented is a subject for another post.