Hystrix to the rescue
A few month ago I read Release It! - a highly recommended book no matter if you are doing dev, ops or devops. One key aspect described in this book is the usage of so called circuit breakers. Generally they are used to have a handy kill switch around watching your interfaces.
Imagine an application like a news aggregator site which shows an user his aggregated news feed. This means you display a lot of content from various news pages (and of course you have lots of caching going on). But at the moment your caches invalidate and you have to hit the source news feeds (within the request) you mostly do not know what will happen with the interfaces in the next (milli)seconds when the user loads up your page. A few scenarios:
- Everything works fine: *yeah* nothing to to do here…
- Your app fails before you get to the news feeds: This article won’t help you.
- The interface will have some problems:
- Takes longer than expected to load
- Answers with 1 byte/second
- Answers with non-sense content
- (many other reasons)
Let your client suffer
There are so many ways an interface can fail in real-life. If you think your client is rock solid you may checkout bane or hamms to simulate malformed server. These both are explicit misbehaving servers. But in 99% the servers out there don’t intend to harm you but they have other internal problems and a symptom may be one of the results you encounter when using one of these servers.
Play nice with others
By checking the health of a connection you not only make your application more stable, you also care about the system on the other end of the line. Another rule you get from the book is: Never, never, never hit the system that is already down or on a good way. It’s like we developer talk about the user (maybe your customer): ‘The web page does not respond…wait I just click refresh again, again (and again…)’ The same happens here. Ever thought about that there is this slightly significant possibility that the system does not answer because it has 99% system load? Yep, it won’t help anybody if we wrap our request code in a try/catch-block and fire it again if it has failed before or even better try to get the response as long as it will take. Just don’t, please
(in the mean time)
What’s going on with our user? Hello, are you still waiting there? The page takes much longer to respond. I see you are continuously pressing refresh because our site does not respond but there is no cached entry. Let me ask the news page if… Now our application is under attack from the user side and every request to our app triggers a new request on the news server. From the user perspective we f.. are to blame. It is our site that does not respond. He does not get the information that our app is slow (aka dead) because some other server is on a straight way to hell.
How do we get out of this situation
Use a circuit breaker, Luke. That’s all. Just kill the connection to the server if something unexpected happens for <some> amount of time. After this time you check if it’s safe to use the connection again. It’s so easy but honestly sometimes you just needs one (the book) or two (a worst case scenario system on the other side) hints to get you on the track. At first we tried some own-production-counting-errors-and-open-the-circuit-breaker-stuff but shifted to hystrix by netflix due to several concerns about the time we will have to invest to get a rock solid circuit breaker lib.
Taking our sample from above we want to request remote servers for fresh news. First we start with our simple model Ticker. One Ticker represents the result of one news page.
In order to get the news to our web app we have a service called *drumroll* TickerService.
This service calls some kind of repository (TicketReader).
The TickerReader catches the fresh news from a news page and wrapps them in the Ticker list. Under normal circumstances this reader will work like the following sample code. Just imagine that there is some kind of crazy heavy network request / response stuff going on. For our samples we just take local code.
In order to simulate some failing behaviour I introduced a failing…
…and a long running TickerReader.
These are two very common problems you can run into if using external systems. You can handle the first one by wrapping it in a try/catch-block but the second one is harded. We don’t just talk about a timeout you might set in your connection but a slow responding server that sends response bytes over a long period of time. Some solution might be to spawn a thread that watches the time the execution of the method grabNews() takes and kills it. But you are here for hystrix. Let’s get it on.
Using hystrix for timeout and exceptions
Hystrix offers HystrixCommand to wrap your interface code into. The idea behind that is, that you don’t call the TickerReader directly but through a HystrixCommand#execute() method. Here is a simple command to wrap our reader.
The execute() call will trigger the run() method. If something is going wrong the getFallback() method gets called and you can react on a failure by giving an appropriate answer that fits your use case. Hystrix creates statistics internally. Therefor a group key is given in the super constructor to collect all commands of the same kind. As you can see the TickerService will now call the TickerReaderCommand instead of the TicketReader and always create a new command object.
The HystrixCommand also accepts a property-_Setter_ in the contructor args to customize e.g. timeout, failures, measure window… I don’t want to copy the good documentation here.
Be nice, the hystrix way
At the current state we made our application more solid for problems in other systems. But we are still trying to get data from the remove server even if we should know it better. So let’s introduce a circuit breaker in our command. Happy news: We are still using a default configured circuit breaker by using the HystrixCommand with a threshold of 20 errors within one second. To be more explicit I will show a simple test setup with explicit waits to see the circuit breaker at work. Additionally I descreased the number of errors to 2.
The output will show some false and switch to true after four calls. This depends on the metric window timeout and the number of buckets you use. In short buckets divide the metric window in chunks of the same size to create some measuring points. Again I will point to the documentation for further reading on the arguments you can provide your command.
With this few steps of configuration we created a interface to another systems that provides our app a fallback without try/catch handling if there is something wrong on the wire and it trips the circuit if too much (you’re defining it) failures happen. When the circuit is open our app responds much faster because we don’t even check the remote site and in the end…we don’t hit the guy lying on the ground.
You can find the complete hystrix sample code on github.