I've not had that much luck deploying Azure AD Connect and ADFS 3.0 in Azure for a client in the last few weeks. After some networking woes I've moved onto the server provisioning and again got stuck. Now, I know IT is not meant to be easy otherwise there wouldn't be some of the salaries paid out to the best and brightest, this install though was simple and nothing out of the ordinary. A standard deployment that I and many others have done before.
Let me paint the picture: ADFS is now running, although not working, in Azure compute across a load balanced set of two servers with a further load balanced set of web application proxy (WAP) servers in front. There's two domain controllers and a AAD Connect server all across a couple of subnets in a VNET.
I've installed ADFS though the AADConnect wizard. Its a fairly new way of doing things and a completely streamlined process. I wanted to try how that went. It shouldn't be that complicated to deploy and configure a Windows server role anyway (ADFSv3 on Server 2012 R2 is a Windows Server role). So it was nice to sit back, enjoy the ride and well unfortunately now not enjoy the reward, rather, enjoy the headache.
The following is the ADFS login page after an unsuccessful sign in:
I spend a half a day almost troubleshooting this. I made some good inroads and progress only to always fall short. The problem being that there is not much in the way of documentation online. So what are some of the steps I went through and why didn't they work:
Authentication locally - Authentication via ADFS was actually working. I had gone to the URL https://sts.mydomain.com.au/adfs/ls/idpinitiatedsignon and successfully signed in. Essentially its a dummy logon to see if ADFS is working. Indeed I signed on and I was prompted with “you've successfully logged on” and so I signed out. Next I moved onto ADFS itself.
Checked ADFS configuration - AAD Connect did the entire ADFS config for me. I entered in the required information in the wizard and ran through the process- great! However, this was the first point of reference as I thought perhaps automation sometimes can be prone to failure. All checked out though. All the settings were correct and nothing was out of the ordinary.
Reset the relying party trust to Office 365 - Thinking that there may have been a problem somewhere in the relying party trust, I deleted it. This can be re-created by running the following PowerShell cmdlet:
Unfortunately re-creating the relying party trust didn't work either.
Checking the event logs on the primary ADFS server - I know, I know. Everyone always says to check event logs first to see what's what. While I initially looked at the logs before doing any work, I overlooked a key line item that made me go through the preview steps first. So coming back to the eventvwr I examined the EVENT ID 364 and EVENT ID 111 in more detail rather than looking at the obscure first couple of lines. The error I received was as follows:
I examined the errors in more detail and found a line in Event ID 364 that looked significant in that it referenced something I thought would have been fine: ADDS. The line in question is as follows:
Next stop: Google.
Googled the error - This process is always like flipping a coin. You can find great success on article 1 on page 1 of the search results. Or on the other side (where I was) you find less than 10 articles and limited knowhow about the problem. I persisted and ran though various blogs, sites and support articles.
This is rather tricky and probably something that won't be the case for everyone. In my situation the problem stems from the on-premises or existing ADDS environment. It's a single forest with a single domain. There are domain controllers from Server 2003 R2 all the way up to Server 2012 R2. The problem is not so much the ADDS environment; rather the maintenance and management of the ADDS environment. There are about 25 sites with the client I'm working with and there have been updates, changes and failures of DC's across the board. Why the problem was maintenance and management was that there were stale records for failed or “decommissioned” DC's. The solution was to run through an in-depth remediation process of ADDS, ADDS integrated DNS, ADDS sites and services and finally the NTDS database to remove stale records for old DC's.
From what I've found ADFS can't be forced to query a single DC. When ADFS looks up ADDS information and queries are made behind the scenes, if there's a problem along the way or in the chain, then this error.
I've been lucky in that all of the ADFS 2, ADFS 2.1 and ADFS 3.0 deployments I've completed thus far were on domains that didn't have any problems. There was no stale DC's and for the most part maintenance and management had it under control.
Driving a car is second nature. Typing on your keyboard without looking at the keys is something you do every day. Even though you're proficient in something or its something you can do subconsciously, it's always a good thing to do some due diligence.
I'll be making sure to ask if ADDS has been managed in good manner from here on in. Stale records and domain controllers that have long since been removed from active service need to be removed correctly to keep ADDS performance optimal and future expansion of services, i.e. ADFS, from being impacted.