Introduction
In part 1, we simulated a DAG member failure and took a look at the troubleshooting tools that we could use to get more information about what is happening behind the scenes in an IPless DAG.
In this part, we’ll discuss DAG network basics then simulate a REPLICATION network failure on one of the DAG members, see how the DAG reacts and how to recover from the failure.
To go to other parts in this series:
- Exchange 2016 Database Availability Group Troubleshooting (Part 1)
- Exchange 2016 Database Availability Group Troubleshooting (Part 3)
DAG Cluster Network Basics
First, let’s cover some of the background on DAG cluster networks. We’ll look at these aspects:
- Cluster network types
- MAPI and REPLICATION network failures
Cluster network types
In a failover cluster, you have different types of network. If we had set up a DAG which uses a cluster with an administrative access point, (as in Windows Server 2012 and earlier) you would be able to use the Failover Cluster Manager to view the cluster networks and you’d see that there are three possible cluster network roles. This screenshot is from one of my test scale out file server clusters:
There are three different cluster network roles:
This is all very interesting but we can’t use the Failover Cluster Manager for IP-Less DAGs and this is not a post about scale out file servers. Let’s get back to our DAG and remind ourselves of our lab servers and network configuration:
Let’s go back to our two node IP-Less DAG and use PowerShell to get information about our cluster networks:
Get-ClusterNetwork | ft Name,Address,AddressMask,Role,State -AutoSize
Now we have this information, we can look up the roles in the cluster networks table above and we then see that the MAPI network (10.2.0.0/16, Cluster Network 2) is configured for Cluster and Client communication (Role = 3) whereas the REPLICATION network (172.16.0.0/16, Cluster Network 1) is configured for Cluster Only communication (Role = 1) as we’d expect.
We can also confirm that our MAPI network has replication disabled while our REPLICATION network has MAPI access disabled:
Get-DatabaseAvailabilityGroupNetwork | ft Name,Subnets,MapiAccessEnabled,ReplicationEnabled -AutoSize
MAPI and REPLICATION network failover
As we’re using separate MAPI and REPLICATION networks and there is a failure of the REPLICATION network, the MAPI network will then be used for REPLICATION, even if it has ReplicationEnabled set to false. Although the REPLICATION network is used for cluster communication, this shouldn’t cause a node down scenario as the MAPI network can also be used for cluster communication and therefore the cluster heartbeat is not lost. To find out which network is currently in use for REPLICATION, you can run the Get-MailboxDatabaseCopyStatus cmdlet to check each DAG member and for each mailbox database copy. You’ll see that the IncomingLogCopyingNetwork property includes the server that the database copy is receiving logs from and the network it is using. This IncomingLogCopyingNetwork property only has a value for passive copies:
Get-MailboxDatabaseCopyStatus -Server LITEX01 | sort Name | ft Name,Status,CopyQueueLength,ContentIndexState,IncomingLogCopyingNetwork -AutoSize
Get-MailboxDatabaseCopyStatus -Server LITEX02 | sort Name | ft Name,Status,CopyQueueLength,ContentIndexState,IncomingLogCopyingNetwork -AutoSize
Here we can see that we are indeed using our REPLICATION network (ReplicationDagNetwork01) for all mailbox database copies.
As for the MAPI network, a failure on this network should cause a DAG failover. This makes sense because if this network fails, the CAS services on other Exchange servers will not be able to proxy the client connections through to the Exchange server which hosts the active mailbox database copy. The REPLICATION network cannot be used for MAPI connections.
REPLICATION Network Failure Test
Now we know a bit more about what should happen, we can start our test. We’ll cause a failure of of the REPLICATION network on LITEX01 by disconnecting the virtual network card on the virtual machine associated with the REPLICATION network.
We can confirm that LITEX01 was not marked as down in the cluster after the failure:
Get-ClusterNode
Here we see that both nodes have an UP status.
Get-DatabaseAvailabilityGroup -Status
Here we see that both nodes are operational.
Using the below command, we can confirm that Exchange has marked the interface as failed:
Get-DatabaseAvailabilityGroupNetwork | fl
We can also see this in the system event log, we see event 1127 which shows that the REPLICATION network interface has failed.
“Cluster network interface ‘litex01 – REPLICATION’ for cluster node ‘litex01’ on network ‘Cluster Network 1’ failed. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.”
In the Network High Availability Crimson Channel Log (Event Viewer > Applications and Services Logs > Microsoft > Exchange > HighAvailability > Network), we find event 1002 logged when there’s a failed connection on the REPLICATION network:
“A client-side attempt to connect to LITEX02 using ‘172.16.0.22:64327’ from ‘172.16.0.21:0’ failed: System.Net.Sockets.SocketException (0x80004005): A socket operation was attempted to an unreachable host 172.16.0.22:64327
at System.Net.Sockets.Socket.EndConnect(IAsyncResult asyncResult)
at Microsoft.Exchange.EseRepl.TcpClientChannel.TryOpenChannel(NetworkPath netPath, Int32 timeoutInMs, TcpClientChannel& channel, NetworkTransportException& networkEx)”
The cluster has not failed over as the cluster communication is still possible using the MAPI network which can be used for both cluster and client communication.
The databases should not have failed over so let’s confirm this:
Get-MailboxDatabaseCopyStatus -Server LITEX01 | sort Name | ft Name,Status,CopyQueueLength,ReplayQueueLength,ContentIndexState,IncomingLogCopyingNetwork -AutoSize
Get-MailboxDatabaseCopyStatus -Server LITEX02 | sort Name | ft Name,Status,CopyQueueLength,ReplayQueueLength,ContentIndexState,IncomingLogCopyingNetwork -AutoSize
Note that you can see that the IncomingLogCopyingNetwork no longer includes a REPLICATION network name and that we now have an error on database copy MDB03LITEX01. To see the error message in full, we can run this command:
(Get-MailboxDatabaseCopyStatus MDB03LITEX01).IncomingLogCopyingNetwork | fl
The error is:
“An error occurred while communicating with server ‘litex02.litwareinc.com’. Error: Unable to read data from the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.”
This doesn’t seem to mean much as the databases are still ‘in sync’ which can be seen from the CopyQueueLength and ReplayQueueLength which are both 0 for all passive copies on each server.
Test-ReplicationHealth
The Test-ReplicationHealth cmdlet is also a useful troubleshooting tool here. We can use it to run a number of replication health checks on each server:
Test-ReplicationHealth -Server LITEX01
Test-ReplicationHealth -Server LITEX02
Here we can see that the two checks ‘DBLogCopyKeepingUp’ and ‘DBLogReplayKeepingUp’ pass which confirms that the log copy process is still working and this means that it is using the MAPI network.
We can also get more information about the checks that have failed:
Test-ReplicationHealth -Server LITEX01 | ? {$_.Result -match “FAILED”} | fl
The error is just a confirmation that the REPLICATION network interface on LITEX01 has failed:
“Node ‘litex01’ has a network interface that is down. The IP address is ‘172.16.0.21’. Current state is ‘Failed’.”
Fail back to REPLICATION network
Once the networking issue is resolved (i.e. reconnecting the virtual network adapter in this case), we can get Exchange to fail back to the REPLICATION network. Note that this is not automatic. To fail back to the REPLICATION network, we have two options:
- Restart the Microsoft Exchange Replication service
- Suspend and resume the mailbox database copies
We’ll choose the second option as this is less disruptive. First, suspend the passive mailbox database copies:
Suspend-MailboxDatabaseCopy MDB01LITEX02 -Confirm:$false
Suspend-MailboxDatabaseCopy MDB02LITEX02 -Confirm:$false
Suspend-MailboxDatabaseCopy MDB03LITEX01 -Confirm:$false
Suspend-MailboxDatabaseCopy MDB04LITEX01 -Confirm:$false
……then resume the passive mailbox database copies:
Resume-MailboxDatabaseCopy MDB01LITEX02
Resume-MailboxDatabaseCopy MDB02LITEX02
Resume-MailboxDatabaseCopy MDB03LITEX01
Resume-MailboxDatabaseCopy MDB04LITEX01
Once done, we can check whether the mailbox database copies are now using the REPLICATION network for log copy operations:
Get-MailboxDatabaseCopyStatus -Server LITEX01 | sort Name | ft Name,Status,CopyQueueLength,ReplayQueueLength,ContentIndexState,IncomingLogCopyingNetwork -AutoSize
Get-MailboxDatabaseCopyStatus -Server LITEX02 | sort Name | ft Name,Status,CopyQueueLength,ReplayQueueLength,ContentIndexState,IncomingLogCopyingNetwork -AutoSize
As you can see in the two screenshots above, the IncomingLogCopyingNetwork now has a value of ReplicationDagNetwork01 which means that our fail back to the REPLICATION network was successful.
Conclusion
In this post, we’ve gone through a few basics about DAG networks. We’ve also seen what happens when a REPLICATION network fails and then we’ve looked at how to fail it back.
In part 3, I’ll demonstrate what happens when the MAPI network fails.