Exchange 2016 Database Availability Group Troubleshooting (Part 3)

Introduction

In part 1 we simulated a DAG member failure and in part 2, we simulated a REPLICATION network failure.

I this part, we’ll move onto simulating a MAPI network failure, how to identify it and how to recover from it.

To go to other parts in this series:

Cluster Network Roles

In part 2, we discussed the cluster network roles (None, Cluster only, Cluster and Client). If you glossed over that bit and you now have no idea what I’m talking about then click here to go back and read it.

Basically the MAPI network is used for both client and cluster network communication and the REPLICATION network is used for only cluster network communication. In terms of Exchange, the MAPI network is used for incoming client connections from the CAS services on that server or a different server and it’s also used for connections to the LAN and other subnets for AD, internet access etc but as we see from part 2, it can also be used for DAG replication if the REPLICATION network fails.

The REPLICATION network is used for only cluster network communication which means that no CAS services can connect to it to provide access for clients. It’s also on an isolated subnet with no route to the LAN so it can’t be accessed anyways.

When the MAPI network fails, we lose all networks that can be used for client communication so any mailbox databases active on that particular server are inaccessible but that server should remain ‘up’ in the cluster as the REPLICATION network can still be used for the cluster heartbeat.

Let’s see what happens in our testing. Well, in fact, I know what will happen but I won’t spoil it.

MAPI Network Failure Test

We start off with a healthy DAG with our two Exchange 2016 servers ‘up’ in the cluster:

LITEX01
LITEX02

For these tests, the two Exchange servers are configured to use DNS round robin load balancing for the CAS services. For more information on DNS round robin load balancing, see here.

There are four mailbox databases, MDB01, MDB02, MDB03, MB04 and we have a copy of each on each mailbox server.

MDB01 and MDB02 are active on LITEX01and there are passive copies on LITEX02
MDB03 and MDB04 are active on LITEX02 and there are passive copies on LITEX01

We can confirm this by running the commands below:

Get-MailboxDatabaseCopyStatus -Server litex01 | Sort Name

Get-MailboxDatabaseCopyStatus -Server litex02 | Sort Name

We can confirm that our cluster networks are up also:

Get-ClusterNetwork | ft Name,Address,AddressMask,Role,State -AutoSize

Now let’s go ahead and disconnect the virtual network adapter for the MAPI network on LITEX01 and see what happens.

After a few seconds, the databases are mounted onto LITEX02:

Get-MailboxDatabaseCopyStatus -Server litex02 | Sort Name

Check DAG members are up

Let’s check the cluster nodes are up:

Get-ClusterNode

Yes, both nodes are up as we’d expected as the REPLICATION network is now providing the cluster communication between the nodes.

We can also use the Get-DatabaseAvailabilityGroup cmdlet to check this:

As you can see, both servers are operational.

Check DAG networks

Let’s check that our MAPI network (10.2.0.0/16) is down but our REPLICATION network is still up:

Get-ClusterNetwork | ft Name,Address,AddressMask,Role,State -AutoSize

Now you can see that our MAPI network is down (ahem, up). Yes, that’s right, our MAPI network is up despite the MAPI interface being disconnected on LITEX01.

So, how do we actually find that out that our MAPI network interface is in fact down? Well, in order to check the network interfaces themselves, we can either use the Windows Failover Cluster manager if we have an Administrative Access Point (i.e. not an IP-less DAG) but in our case and with the recommendations from Microsoft, we are using an IP-less DAG. In this case, we can use the Get-DatabaseAvailabilityGroupNetwork or the Get-ClusterNetworkInterface cmdlets to check the individual interfaces:

Get-DatabaseAvailabilityGroupNetwork | ft Identity,Subnets,Interfaces -Autosize

Get-ClusterNetworkInterface | ft -AutoSize

Now this makes a bit more sense – both cmdlets show that our MAPI network on LITEX01 has failed.

Application event log

In the application event log on LITEX02, we see the event below which lets us know that the MAPI network on LITEX01 has failed:

Log Name:      Application
Source:        MSExchangeRepl
Date:          11/01/2016 12:42:26
Event ID:      3190
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      litex02.litwareinc.com
Description:
The MAPI network for server litex01 has failed.

This is then followed by event 2090 for each mailbox database that was active on LITEX01. This lets us know that the database will be mounted on LITEX01:

Log Name:      Application
Source:        MSExchangeRepl
Date:          11/01/2016 12:43:15
Event ID:      2090
Task Category: Action
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      litex02.litwareinc.com
Description:
Database: MDB01LITEX02
Mailbox server: litex02.litwareinc.com

A failover or switchover occurred. Database MDB01LITEX02 will be mounted with no data loss.

It’s then followed by event 40018 to let us know that each mailbox database that was mounted on LITEX01 is now mounted on LITEX02:

Log Name:      Application
Source:        MSExchangeIS
Date:          11/01/2016 12:43:23
Event ID:      40018
Task Category: High Availability
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      litex02.litwareinc.com
Description:
Active database MDB01 (fa0c03ad-3b51-45fe-b779-b689a3936aed) was successfully mounted.

Process id: 19448
Mount time: 7.4872661 seconds
Log generation: 0x3
Previous state: 3

Finally, event 3169 is logged for each mailbox database on LITEX01 which confirms that the databases were mounted on LITEX02:

Log Name:      Application
Source:        MSExchangeRepl
Date:          11/01/2016 12:43:23
Event ID:      3169
Task Category: Service
Level:         Information
Keywords:      Classic
User:          N/A
Computer:      litex02.litwareinc.com
Description:
(Active Manager) Database MDB01 was successfully moved from litex01.litwareinc.com to litex02.litwareinc.com. Move comment: None specified.

Crimson channel logs

In the High Availability Operational log (Applications and Services Logs/Microsoft/Exchange/HighAvailability/Operational), we can see more information about what is happening on the DAG. If you find that your DAG takes a long time to fail over, the best places to troubleshoot this is this Operational log and the failover cluster logs which you can see later in this post.

At the moment the MAPI network interface on LITEX01 fails, (or some mad scientist decides to unplug it to show you what happens), we find that a server failover is initiated:

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          11/01/2016 12:42:26
Event ID:      137
Task Category: Automounter
Level:         Information
Keywords:
User:          SYSTEM
Computer:      litex02.litwareinc.com
Description:
Initiating server failover (Node=litex01.litwareinc.com).

We then see the selection of the best copy for activation where Exchange goes through an algorithm to find the best passive copy to activate. It states that LITEX02 is the best (only!?) copy to activate:

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          11/01/2016 12:43:08
Event ID:      150
Task Category: Database Action
Level:         Information
Keywords:
User:          SYSTEM
Computer:      litex02.litwareinc.com
Description:
Database copy ‘MDB01’ on server ‘litex02.litwareinc.com’ has been selected for possible activation. The following checks were run: ‘IsHealthyOrDisconnected, ReplayQueueLength, IsPassiveCopy, IsPassiveSeedingSource, TotalQueueLengthMaxAllowed, ManagedAvailabilityAllHealthy, ActivationEnabled, MaxActivesUnderHighestLimit’.

Once done, Exchange reports which server it will move the mailbox databases to:

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          11/01/2016 12:43:08
Event ID:      342
Task Category: Database Action
Level:         Information
Keywords:
User:          SYSTEM
Computer:      litex02.litwareinc.com
Description:
Proceeding with move for database ‘MDB01’ (FromServer=litex01.litwareinc.com, ToServer=litex02.litwareinc.com, MoveComment=None specified.)

A few more events are logged to notify of any lossy failovers then we see event 322 where the database has been moved successfully:

Log Name:      Microsoft-Exchange-HighAvailability/Operational
Source:        Microsoft-Exchange-HighAvailability
Date:          11/01/2016 12:43:15
Event ID:      322
Task Category: Database Action
Level:         Information
Keywords:
User:          SYSTEM
Computer:      litex02.litwareinc.com
Description:
Active server for database ‘MDB01’ changed from litex01.litwareinc.com to litex02.litwareinc.com

Okay, so that’s all there is in the application and crimson channel event logs. Let’s move on to the failover cluster log.

Failover cluster log

To generate the failover cluster log, we can run the Get-ClusterLog cmdlet to generate the last 5 hrs of logs or whatever you specify. The TimeSpan value is in minutes:

Get-ClusterLog -TimeSpan 300 -Destination C:tempClusterLog

If we look through the events for around the time of the failure, we can find these events which are logged when there are missed heartbeats from LITEX01, 10.2.0.21:

00000888.00000b84::2016/01/11-12:42:24.025 INFO [IM] got event: LocalEndpoint 10.2.0.22:~3343~ has missed two consecutive heartbeats from 10.2.0.21:~3343

00000888.00000b84::2016/01/11-12:42:24.025 INFO [CHM] Received notification for two consecutive missed HBs to the remote endpoint 10.2.0.21:~3343~ from 10.2.0.22:~3343~

Eventually, LITEX01 is marked as unreachable and the route is marked as down:

00000888.00000bb4::2016/01/11-12:42:26.994 DBG [NETFTAPI] Signaled NetftRemoteUnreachable event, local address 10.2.0.22:3343 remote address 10.2.0.21:3343

00000888.00000b84::2016/01/11-12:42:26.994 INFO [IM] got event: Remote endpoint 10.2.0.21:~3343~ unreachable from 10.2.0.22:~3343~

00000888.00000b84::2016/01/11-12:42:26.994 INFO [IM] Marking Route from 10.2.0.22:~3343~ to 10.2.0.21:~3343~ as down

The cluster service then checks the other routes to see if they are also down or not so it knows whether the node has failed and we can see below that the REPLICATION network (172.16.0.0/16) is still up:

00000888.00000b84::2016/01/11-12:42:26.994 INFO [NDP] Checking to see if all routes for route (virtual) local fe80::79e5:232a:8668:43a9:~0~ to remote fe80::746d:ee43:5344:14e4:~0~ are down

00000888.00000b84::2016/01/11-12:42:26.994 INFO [NDP] Route local 172.16.0.22:~3343~ to remote 172.16.0.21:~3343~ is up

Cluster validation checks

Now we can attempt to validate the cluster using the Test-Cluster cmdlet:

Test-Cluster

The above error is because LITEX01 is not responding on the MAPI network. You won’t be able to validate the cluster in this case.

Test-ReplicationHealth

When we run this cmdlet, we get a few failures:

Test-ReplicationHealth

To view more information about the errors, we can filter for the failures and output the errors by formatting the output as a list, fl:

Test-ReplicationHealth | ? {$_.Result -match “Failed”} | fl Check,Result,Error

Now, that’s quite a lot of information however it’s only really three issues:

LITEX01 has a failed network interface, 10.2.0.21
All databases fail the availability checks
All databases fail the redundancy checks

The database checks fail because the service is down on LITEX01. You can see this status by using the Get-MailboxDatabaseCopyStatus cmdlet:

Get-MailboxDatabaseCopyStatus -Server litex01 | Sort Name

CollectOverMetrics.ps1

We’ve covered this built-in script in Part 1 but essentially it provides a report of the failovers since the date specified in the command. So, let’s go ahead and run it and see what we find. Make sure you change to the Exchange scripts directory:

cd $exscripts

.CollectOverMetrics.ps1 -StartTime “01/11/2016 12:00” -GenerateHtmlReport -ShowHtmlReport -DatabaseAvailabilityGroup DAG01

When we look at our HTML report, we can see that the databases failed over and the “Action Reason” was MapiNetFailure (MAPI network failure):

How to recover from a MAPI network failure

Now that we’ve confirmed that a MAPI network failure on a DAG member causes a failover of all mailbox databases that were active on that DAG member, we can now recover the node. First, reconnect the MAPI network adapter or resolve the networking issue. Next, confirm that the cluster has marked the MAPI interface as up:

Get-ClusterNetworkInterface | ft -AutoSize

Once done, check the replication health:

Test-ReplicationHealth

…..and check that the mailbox database copies are healthy on LITEX01:

Get-MailboxDatabaseCopyStatus -Server litex01

Now we can rebalance our mailbox databases using the .RedistributeActiveDatabases.ps1 script. You can find out more information about this script here.

Conclusion

This concludes part 3 of our DAG troubleshooting series. We’ve caused yet another failure – a MAPI network failure – and we’ve gone through the different troubleshooting steps and recovery from the failure.

Stay tuned for part 4 where we’ll look at DAG member storage failures.

Exchange 2016 Database Availability Group Troubleshooting (Part 3)

Exchange 2016 Database Availability Group Troubleshooting (Part 3)

Introduction

Cluster Network Roles

MAPI Network Failure Test

Check DAG members are up

Check DAG networks

Application event log

Crimson channel logs

Failover cluster log

Cluster validation checks

Test-ReplicationHealth

CollectOverMetrics.ps1

How to recover from a MAPI network failure

Conclusion

You may also like:

Receive connector logging | Exchange 2013, 2016

OWA blank page | Exchange 2013, 2016

Bindings and RemoteIPRanges parameters conflict | Exchange 2013, 2016