Saturday, December 20, 2014

Connectivity issues with RD Gateway and partner organization

After several weeks of troubleshooting and scratching our heads, I just found the solution to a problem we had been having with RD Gateway. A partner organization had a direct connection into our datacenter and we were trying to utilize that connection to allow them access into our RDS-hosted virtual desktop environment. I had stood up a pair of RD Gateway/RD web access servers running on Server 2012, placed them behind a load balancer, and had our security team open the correct ports on the firewall. Everything should have been working, but apparently Murphy's law struck...

They could access the RDWeb page and successfully authenticate, but when actually attempting a connection to a VDI, the connection would hang for them. The odd thing was, I could see successful connections coming across our RD Gateway server. The sequence of events on the gateway server was as follows:
  1. The gateway server sees an incoming connection request, and allows the request based on the connection authorization policy.
  2. The client attempts to connect to our connection brokers, and the connection is allowed based on the resource authorization policy.
  3. The client successfully connects to the connection broker.
  4. About 8 seconds later, the session to the connection broker is destroyed (meaning the broker has done it's part, and forwarded an available VDI to the client).
  5. Another incoming connection request, this time wanting to connect to the VDI specified by the connection broker.
  6. The client successfully connects to the VDI (the end-user even sees a certificate error based on the VDI's self-signed RDP certificate).
  7. Anywhere from 0-2 seconds later, the session is disconnected. It's not an error or warning message, it's an informational message, almost like the client is gracefully disconnecting.

Event showing the user gracefully disconnecting from the VDI after one second.
Event showing the user gracefully disconnecting from the VDI after one second.

This is the series of events as I see it from the perspective of the gateway server. From the end-user's perspective, they receive a certificate error as the initial connection to the VDI is established, and at that point, the RDP client appears to hang trying to finish the connection. It was clearly not a gracefully disconnect from the client's side, something was going wrong.

The other odd thing about this was that we had done extensive testing and proven that the gateway server was configured and functioning correctly. We had attached a laptop to the same switch as the partner organization's firewall, even going so far as copying the port config to ensure everything was identical. Of course we were able to connect to the VDI with no issues. Proves the issue must be on their side, right?

Contrary to that thought, I had given them access to a published RemoteApp, which they could launch and connect to fine. So on the one hand, we had a test that basically confirmed the problem was on their side, and we had another test that basically confirmed the issue resided in our VDI's.

After struggling with this issue for several weeks, with multiple calls between each organization's engineers, examining log files, and capturing network traces, we finally had a eureka moment. A network engineer from the partner organization had asked for the local IP address of one of the VDI on our end, in order to search their network traces for that address. He had found what appeared to be failed ICMP attempts. Also looking at the IP address we gave him, he noted that they utilize the same IP range in their network.

Now, there was already a NAT in place pointing to our load-balanced gateway servers, but it appeared that when connecting to the VDI, their RDP client was not tunneling the traffic through the NAT'ed gateway server, but was instead trying to connect directly to the IP address of our VDI. That's when it hit me.
The default option to bypass the RD gateway server.
The default option to bypass the RD gateway server.

In RDS 2012, when you setup a gateway server and add it to the RDS farm through Server Manager, it automatically edits some settings in the farm properties. In Server Manager, if you go into the farm properties, and click the RD Gateway tab, it will have filled in the DNS address of your gateway server, as well as some other information. There is a small checkbox towards that bottom that states "Bypass gateway server for local addresses." This was enabled by default. So here's what was happening:
  • When the partner organization attempted a connection to our VDI, it first tunneled through the gateway server, and contacted the connection brokers in order to find a VDI.
  • Once the connection brokers had found a suitable VDI, it passed back the local IP address of the VDI.
  • Since our VDI was on a subnet that they also used on their internal network, their RDP client saw that it was a local address and bypassed the gateway server, instead trying to connect directly to the IP address.
I proved this theory correct by making a copy of the RDP file that is generated, opening it in Notepad, and modifying the gatewayusagemethod attribute from a value of  2 (bypass for local addresses) to a value of 1 (always use gateway server, even for local addresses). I sent them this modified RDP file, had them launch it, and voila! They connected to our VDI just fine.

Our next course of action is to disable the "Bypass gateway server for local addresses" option at the farm level. This will allow us to deliver a consistent user experience for the partner organization, allowing them access to the environment via RDWeb the same as we do internally. It's also a benefit to our help desk as they only have to support a single access method. The only downside to this is it will force all of our internal RDS traffic to tunnel through the gateway servers, despite the end-users having a direct route to the VDI's. I think we can manage that trade off.

This was a tough one to crack. None of us knew where the issue lay. We had checked firewalls, logs, network traces, consulted with outside experts, yet despite all of that, we couldn't find anything wrong. I'm extremely glad to have finally figured out this issue as it had been plaguing me for several weeks. Not too shabby for a lazy Friday!

P.S. - This is the webpage I utilized to find all the RDP attributes, very handy indeed! - http://www.donkz.nl/files/rdpsettings.html

No comments:

Post a Comment