background

Our company needs to provide unified authentication and authentication services for front-end applications and back-end micro-services. It needs to support Oauth2 standard and Role Based Access Control (RBAC), that is, role-based Access Control, as well as multi-cloud. After investigation and selection, KeyCloak was finally chosen to fulfill the above requirements.

About KeyCloak

Keycloak is an open source identity and access control software that provides single sign-on (SSO) functionality and supports OpenID Connect, OAuth 2.0, and SAML 2.0 standard protocols. Keycloak provides a customizable user interface for login, registration, and account management. As of March 2018, Red Hat manages this JBoss community project as an upstream project for their RH-SSO offerings. From a conceptual point of view, the tool is intended to make applications and services more secure easily with little or no coding.

Infrastructure architecture

The diagram above is the architecture we used when we deployed KeyCloak on AWS. We used the Amazon Fargate service to run the Amazon ECS cluster. Amazon Fargate is a Serverless computing service for containers. We do not need to manage servers, and we can isolate applications, allocate resources to each application individually and improve security. To ensure high system availability, two tasks are defined in the Amazon ECS service, located in different AZs (availability zones), so if one task cannot provide the service, the other task will continue to provide the service. External requests access the ECS through the Application Load Balancer (ALB). The database type is Amazon RDS for MySQL.

Amazon ECS supports Docker, and you can deploy and run the container image of Keycloak. We are using an image provided by the JBOSS community at version 16.1.1.

The basic configuration

Database Configuration

KeyCloak uses the H2 database by default. To use Amazon RDS for MySQL, we have added the following configuration to the environment variables:

Generic variable names can be used to configure any Database type, defaults may vary depending on the Database.

  • DB_ADDR: Specify hostname of the database (optional). For postgres only, you can provide a list of hostnames separated by comma to failover alternative host. The hostname can be the host only or pair of host and port, as example host1,host2 or host1:5421,host2:5436 or host1,host2:5000. And keycloak will append DB_PORT (if specify) to the hosts without port, otherwise it will append the default port 5432, again to the address without port only.
  • DB_PORT: Specify port of the database (optional, default is DB vendor default port)
  • DB_DATABASE: Specify name of the database to use (optional, default is keycloak).
  • DB_USER: Specify user to use to authenticate to the database (optional, default is “).
  • DB_PASSWORD: Specify user’s password to use to authenticate to the database (optional, default is “).

Open statistics

In order to report to ALB and ECS whether the container is running properly, we need to turn on the Endpoint for viewing statistics. According to the image documentation, we set KEYCLOAK_STATISTICS=all.

Keycloak image can collect some statistics for various subsystem which will then be available in the management console and the /metrics endpoint. You can enable it with the KEYCLOAK_STATISTICS environment variables which take a list of statistics to enable.

The cluster configuration

As shown in the diagram above, we would run multiple KeyCloak instances at the same time, and these instances would need to recognize each other to form a cluster and share memory data. Following the image documentation, we set the environment variable JGROUPS_DISCOVERY_PROTOCOL to dns.dns_ping.

Clustering

Replacing the default discovery protocols (PING for the UDP stack and MPING for the TCP one) can be achieved by defining some additional environment variables:

  • JGROUPS_DISCOVERY_PROTOCOL – name of the discovery protocol, e.g. dns.DNS_PING
  • JGROUPS_DISCOVERY_PROPERTIES – an optional parameter with the discovery protocol properties in the following format: PROP1=FOO,PROP2=BAR
  • JGROUPS_TRANSPORT_STACK – an optional name of the transport stack to use udp or tcp are possible values. Default: tcp

Import preliminary Settings

There are some initial Settings that we need to import automatically when the container starts, so that we don’t have to log in to the page and manually configure them. Referring to the documentation, we set KEYCLOAK_IMPORT.

Importing a realm

To create an admin account and import a previously exported realm run:

docker run -e KEYCLOAK_USER=\<USERNAME> -e KEYCLOAK_PASSWORD=\<PASSWORD> \\
   -e KEYCLOAK_IMPORT=/tmp/example-realm.json -v /tmp/example-realm.json:/tmp/example-realm.json jboss/keycloak
Copy the code

Began to hit the pit

With everything in place, we posted the image to AWS expecting to see the KeyCloak login page, only to see a bunch of errors instead. I didn’t expect it to be a pit, and what we didn’t expect was that it was a chain pit.

Pit 1: database connection failed

At first, the ECS task could not be started successfully. After reading the log, it was found that the database connection failed. Strangely, the same image configuration that we had in the local Docker environment worked fine, but AWS was down. Careful analysis of the following logs shows that it should be related to SSL Settings.

Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
Caused by: javax.net.ssl.SSLHandshakeException: No appropriate protocol (protocol is disabled or cipher suites are inappropriate)
Caused by: java.sql.SQLException: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/KeycloakDS
Caused by: javax.resource.ResourceException: IJ000453: Unable to get managed connection for java:jboss/datasources/KeycloakDS
Copy the code

The solution

Because ECS and RDS are in the same VPC, their direct connections do not require SSL encryption. After we added the following configuration to the environment variable, the problem was resolved.

- JDBC_PARAMS: useSSL=false
Copy the code

Pit 2: Unable to import initial Settings

The ECS task started successfully, but was not initially imported. After investigation, it is found that the script upload switch upload_scripts=enabled needs to be turned on. Also, we found that with KEYCLOAK_IMPORT, the Settings are only imported on the first startup, and the import is not performed if the Realm already exists in the database. What we want is for the configuration file to be imported into the AWS environment every time it is modified. As a result, after referring to the KeyCloak official documentation, we removed the KEYCLOAK_IMPORT setting and switched to the following migraion setting.

The solution

- JAVA_OPTS_APPEND=-Dkeycloak.profile.feature.upload_scripts=enabled \
-Dkeycloak.migration.action=import \
-Dkeycloak.migration.provider=singleFile \
-Dkeycloak.migration.realmName=my-realm \
-Dkeycloak.migration.usersExportStrategy=REALM_FILE \
-Dkeycloak.migration.file=/tmp/my-realm.json
Copy the code

Pit 3: Health Check failed

It is found that the ECS task will be forcibly shut down by THE ALB shortly after it is successfully started. The reason is that the Health Check of the ECS task fails. The ALB considers that the ECS task is not running properly, so it forcibly shuts down the ECS task and restarts the new ECS task.

The solution

Although we set KEYCLOAK_STATISTICS=all, there is also a hide option that needs to be turned on. First, in the task configuration of the ECS, port 9990 needs to be turned on. Then add the following configuration to the JVM system parameters:

- JAVA_OPTS_APPEND= ... - Djboss. Bind. Address = 0.0.0.0...Copy the code

Pit 4: Unlimited redirection after login

Now you can finally see the KeyCloak login interface, but don’t be careless, there is a pit behind. After entering a user name and password, the browser page redirects several times and then settles on a white page.

WTF ? !

After a long investigation, a trace was found in an obscure corner of official documents. The documentation honestly describes this problem as causing an infinite number of redirects.

If you try to authenticate with Keycloak to your application, but authentication fails with an infinite number of redirects in your browser and you see the errors like this in the Keycloak server log:

2017-11-27 14:50:31,587 WARN [org.keyclothing. events] (default task-17) TYPE =LOGIN_ERROR, realmId=master, clientId=null, userId=null, ipAddress=aa.bb.cc.dd, error=expired_code, restart_after_timeout=trueCopy the code

it probably means that your load balancer needs to be set to support sticky sessions.

The solution

Modify the Settings for AWS ALB Targe Group and turn the Stickness switch on.

Pit 5: Cluster mode cannot be enabled

The cloak cloak management page is now ready for you to log in to the cloak cloak management page. The cloak cloak management page is opened and the unlogged-in user is redirected to the cloak cloak login page. At this point, the front-end page got the Access Token, just when I thought I was done, accident still happened, when holding the Token to Access the back-end API, some Access was successful, some Access failed. The reason for this failure is 401 Unauthorized. Even more strange, the same API, the same parameters, this access failed, the next access is successful!!

This problem also cannot be repeated in the local Docker environment. After searching Google, I did not find any clue. Finally, I had to rely on my own reasoning. Since the problem is random and the local environment is fine, it is presumed to be due to a multi-node deployment. For example, the cloak has two nodes A and B, where B is the problem node. Since the request is randomly allocated by ALB, an error occurs only when the request is distributed to B.

To test this hypothesis, I shut down one ECS task (the KeyCloak instance). At this point, there is only one task running, and all requests either succeed or fail. There should be no random failures. So from the front end of the verification, as I expected! All requests are successful when a single node runs!

Wait a minute! Didn’t our ALB open Sticky Session? Requests from the same client should be distributed to the same instance for processing. Why is it randomly assigned to nodes A and B? This is because the REQUEST from the back-end API will be allocated to the Gateway instance after the ALB, and the Gateway will then request KeyCloak to validate the Token. If we only started one Gateway instance, the problem would not arise again.

The solution

The root cause is that the multiple nodes at KeyCloak are not networked successfully and do not operate as a whole in cluster mode. If JGROUPS_DISCOVERY_PROTOCOL=dns.DNS_PING is configured, it should be able to discover other nodes through DNS. Just as I was wondering, a sentence in the article reminded me.

TCPPING use TCP protocol with 7600 port. This can be used when multicast is not available, e.g. deployments cross DC, containers cross host.

The DNS_PING we use is also based on TCP, is it also necessary to use port 7600? We enabled port 7600 in the ECS task and allowed inbound TCP requests for port 7600 in the AWS Security Group.

So we see the following log on node A:

INFO [org.infinispan.CLUSTER] (THREAD-7, EJB, IP -10-11-193-82) ISPN000093: Received new, MERGED cluster view for channel ejb: MergeView::[ip-10-11-193-82|1] (2) [ip-10-11-193-82, ip-10-11-193-51], 2 subgroups: [ip-10-11-193-82|0] (1) [ip-10-11-193-82], [ip-10-11-193-51|0] (1) [ip-10-11-193-51]", [org.infinispan.CLUSTER] (THREAD-7, EJB, IP -10-11-193-82) ISPN100000: Node ip-10-11-193-51 joined the cluster",Copy the code

The logs of node B are as follows:

INFO [org.infinispan.CLUSTER] (THREAD-5, NULL, IP -10-11-193-51) ISPN000093: Received new, MERGED cluster view for channel ejb: MergeView::[ip-10-11-193-82|1] (2) [ip-10-11-193-82, ip-10-11-193-51], 2 subgroups: [ip-10-11-193-82|0] (1) [ip-10-11-193-82], [ip-10-11-193-51|0] (1) [ip-10-11-193-51]", INFO [org.infinispan.CLUSTER] (thread-5, NULL, IP -10-11-193-51) ISPN100000: Node ip-10-11-193-82 joined the cluster",Copy the code

Networking successful, problem solved!!

Pit 6: Single point of failure

The system is finally working. You think this is the end? B: That’s too naive.

While doing usability tests, we found that if we turn off a node at KeyCloak, some requests will fail and will be successfully requested once we log back in. Is it multiple nodes again? Isn’t the cluster model already working successfully?

After perusing the official documentation, KeyCloak does not back up cached data by default, and if a node fails, the cached data in that node is lost forever.

By default Keycloak does NOT replicate caches like sessions, authenticationSessions, offlineSessions, loginFailures and a few others (See Eviction and Expiration for more details), which are configured as distributed caches when using a clustered setup. Entries are not replicated to every single node, but instead one or more nodes is chosen as an owner of that data. If a node is not the owner of a specific cache entry it queries the cluster to obtain it. What this means for failover is that if all the nodes that own a piece of data go down, that data is lost forever. By default, Keycloak only specifies one owner for data. So if that one node goes down that data is lost.

The solution

Add the following configuration to the environment variable to set the number of backups to 2 or more:

- CACHE_OWNERS_COUNT=2
- CACHE_OWNERS_AUTH_SESSIONS_COUNT=2
Copy the code

conclusion

This article summarizes some of the pitfalls I’ve experienced while deploying KeyCloak on AWS. There’s not much information available online. Code is not easy, if you like, please click “like” attention and share, thank you!

Refer to the link

  • Keycloak
  • Keycloak Docker image
  • Keycloak Cluster Setup