Replication became inactive
If you notice sometimes your replication became inactive and you never came to know about if there is any error or distribution agent job is failing. Even if you have a SCOM configured you never get a Job failure alert.
In this scenario if you notice the properties of Distribution Agent Job and Log Reader Agent job closely you will find that in step 2 for both the jobs “Run Agent” in advance properties you will find Retry Attempt option with value =2147483647.
This is Replication Agent default value to retry 2147483647 times, once every minute. That is over 4000 years, isn’t it strange?
2147483647 = 35791394 Hours = 1491308 Days = 4085 Years.
How does this impact Replication?
When you have this kind of setup in case of any replication agent failure and it is configured for continuous retry, then agent will not report back to SQL Agent Service for this faliure, but will wait 1 minute, then try again for 4000 years. Hence you will never get any kind of alert even if you have SCOM configured.
Example: IF the Distribution Agent, again running continuous, is unable to delivery data because of any error like 20598 or a 2627 error etc., the Agent will go into an infinite retry logic.
20598=Row was not found at the Subscriber when applying the replicated command.
2627 =Violation of PRIMARY KEY constraint ‘PK__A’. Cannot insert duplicate key in object ‘dbo.A’.
In this scenario if undetected, the Agent will stay in retry mode until the Distribution Cleanup job removes the “cached” transactions from the Distribution Database. At this point the Distribution Agent will no longer have the data needed to “catch up” the subscriber. As result, the Subscriber will be marked “inactive” and post the following error in the repl_errors table in the Distribution database.
18854=One or more subscriptions have been marked inactive. Drop and re-create all subscriptions for this node that are failing with this error.
Which Replication Agent are marked to retry for this default value?
You can look at the individual Agent job properties in SQL Server Management Studio or run the following query in the MSDB database.
Select * from msdb.dbo.sysjobsteps where retry_attempts=2147483647
How to fix this issue?
I would recommend setting all of your Replication Agent retry_attempts at smaller interval, say 10 (fail after 10 minutes).
Note: The actual value you select is dependent on your business needs and criticality of replication. For extra measure, you can also add notification ALERT that the JOB has failed.
This should give you enough time to detect the problem, resolve the issue, and get the Replication Agent up and running before the subscriber is marked as “expired”.