What if Actor fails to restart!

Akka.NET is a great toolkit for building highly concurrent, distributed, and tolerant event-driven application. Hmm, I may accidentally quote their page. One can start writing fun stuff with it very quickly. But sometimes you can miss something important.

Writing first Windows Service with Akka.NET isn’t difficult at all. In the documentation there is an article about it [ANET00]. So I got to work and write a service that consumes messages from RabbitMQ and saves given informations to ‘right places’. It was obvious to me that this app needed a few actors. First one working as a RabbitMQ listener and another one as writer handling those ‘right places’. I assumed that it will be beneficent for me if they have some coordinators. My thoughts were that the coordinator-actor tells his child-workers what to do and manages them [ANET01]. And if some timeouts or another unexpected problems occur the coordinator should restart his children. But… aha! there is a but, RabbitMQ may not work or my ‘right places’ may not be working either. In this case the service should go down. If it don’t do its work, it should not be running! Implementing it with Akka.NET is very easy. You could read about it here [ANET02].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// coordinator for our child-workers
public class ProcessorsCoordinator : ReceiveActor
{
	public ProcessorsCoordinator()
	{
		Receive<Models.MonitoringLog>(msg => {
			foreach (var child in Context.GetChildren())
				child.Tell(msg);
		});
	}
	protected override void PreStart() 
	{
		Context.ActorOf(Context.DI().Props<ProcessActor>());
	}
	protected override SupervisorStrategy SupervisorStrategy() {
		return new OneForOneStrategy(
			maxNrOfRetries: 5,
			withinTimeRange: TimeSpan.FromMinutes(5),
			decider: Decider.From(x =>
			{
				if (x is ActorInitializationException) 
				{
					return Directive.Escalate; 
				} 
				else 
				{
					return Directive.Restart;
				}
			})
		);
	}
}

Great, now the coordinator will escalate problem when my worker-actor can’t start. And in my case actor system will be, by default, shut down. But… (yes it’s the second but) what if exceptions during work occur? Worker will throw exception, and the supervisor will decide to restart that actor. Be nasty, the problem didn’t resolve itself, so exception is thrown again. According to our SupervisorStrategy if during 5 minutes there will be more than 5 errors supervising actor will Stop failing actor. And what, you may ask. The answer is: and nothing. Our child-worker is stopped, the rest of system is still running, maybe doing something, maybe not. If stopped actor was crucial for the application, there wouldn’t be any reason for service to continue its work. It’s not good. I want to know when my service is not processing my messages. For me the best solution in this case would be to stop the service. So I modify a little bit the PreStart method. By doing Context.Watch the coordinator registers itself for reception of termination message. In the constructor I add the code to handle situation when any of watching Actors has been stopped. I will this time shutdown actor system. More information about DeathWatch you can find here [ANET03].

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// coordinator for our child-workers
public class ProcessorsCoordinator : ReceiveActor
{
	public ProcessorsCoordinator()
	{
		Receive<Models.MonitoringLog>(msg => {
			foreach (var child in Context.GetChildren())
				child.Tell(msg);
		});
		Receive<Terminated>(msg => {
			Context.System.Shutdown();                           
		});
	}
	protected override void PreStart() 
	{
		Context.Watch(Context.ActorOf(Context.DI().Props<ProcessActor>()));
	}
	protected override SupervisorStrategy SupervisorStrategy() {
		return new OneForOneStrategy(
			maxNrOfRetries: 5,
			withinTimeRange: TimeSpan.FromMinutes(5),
			decider: Decider.From(x =>
			{
				if (x is ActorInitializationException) 
				{
					return Directive.Escalate; 
				} 
				else 
				{
					return Directive.Restart;
				}
			})
		);
	}
}

Great. I did test, I run my app as console app and after a few message I put down my RabbitMQ server and my program stopped. So I changed it a little bit so it works as Windows Service. I compile it, install and run and did that test again. And… service was still on. Hmmm… why? It’s quite simple, my actor system ‘gracefully’ shuts down, but for service it doesn’t mean anything. Service starts, stops, pauses sometimes, and do its job, as long as it will not crash or receive command to stop it will stay on. Ok, so maybe I need to throw exception in my highest actor, or just escalate it all way up, so Guardian can handle it by shutting down the system (more about the Guardian: [ANET02]). But question how to stop Windows Service from inside of actor system stays. In Akka there is registerOnTermination [Akk00]. We can register our piece of code to be run when all actors are stopped in the system after shutdown was issued. Akka.NET offers us WhenTerminated Task [ANET04] we can ContinueWith.

1
2
3
4
5
6
7
8
9
10
11
[...]
_system = ActorSystem.Create("ActorSystem");
var propsResolver = new AutoFacDependencyResolver(container, _system);
var props = _system.DI().Props<MonitoringCoordinator>();
var coordinator = _system.ActorOf(props, "RootActor");

Task termination = _system.WhenTerminated;
termination.ContinueWith(task => {
	requestStopingHosting();
});
[...]

Knowing this makes writing Windows Services with Akka.NET a lot easier. Try it for yourself.

Followup: Working lately with nightly builds of Akka.NET I noticed thah there is implemented RegisterOnTermination. We can use it, but remember RegisterOnTermination action runs after shutdown is requested while _system.WhenTerminated.ContinueWith(…) after system is down. It’s substantial difference.

####Links####

  • [ANET00]: http://getakka.net/docs/deployment-scenarios/Windows%20Service
  • [ANET01]: http://getakka.net/docs/Working%20with%20actors
  • [ANET02]: http://getakka.net/docs/concepts/supervision
  • [ANET03]: http://getakka.net/docs/working-with-actors/Actor%20lifecycle#lifecycle-monitoring-aka-deathwatch
  • [TM]: http://taskmatics.com/blog/run-dnx-applications-windows-service/
  • [Akk00]: http://doc.akka.io/api/akka/2.4.1/index.html#akka.actor.ActorSystem@registerOnTermination(code:Runnable):Unit
  • [ANET04]: http://api.getakka.net/docs/stable/html/FA8BCE63.htm