Tuesday, 14 April 2015

Keycloak on Kubernetes with OpenShift 3

This is the second of two articles about clustered Keycloak running with Docker, and Kubernetes. In the first article we manually started one PostgreSQL docker container, and a cluster of two Keycloak docker containers. In this article we’ll use the same docker images from DockerHub, and configure Kubernetes pods to run them on OpenShift 3.


See the first article for more detailed instructions on how to set up Docker with VirtualBox.

Let’s get started with Kubernetes.


Installing OpenShift 3


OpenShift 3 is based on Kubernetes, so we'll use it as our Kubernetes provider. If you have a ready-made Linux system with Docker daemon running, then the easiest way to install OpenShift 3 is by installing Fabric8.


Open a native shell, and make sure your Docker client can be used without sudo - you either have to have your environment variables set up properly, or you have to be root.


To set up shell environment determine your Docker host’s IP, and set the following:


 export DOCKER_IP=192.168.56.101
 export DOCKER_HOST=tcp://$DOCKER_IP:2375


Make sure to replace the IP with the one of your Docker host.


Now, simply run the following one-liner that will download and execute Fabric8 installation script, which among other things installs OpenShift 3 as a Docker container, and properly sets up your networking using iptables / route ...


 bash <(curl -sSL https://bit.ly/get-fabric8) -f


It will take a few minutes for various Docker images to be downloaded, and started. At the end a browser window may open up - if you are in a desktop environment. You can safely close it as we won’t need it.


The next thing to do is to set up an alias for executing an OpenShift client tool:


 alias osc="docker run --rm -i -e KUBERNETES_MASTER=https://$DOCKER_IP:8443 --entrypoint=osc --net=host openshift/origin:v0.3.4 --insecure-skip-tls-verify"


Note: OpenShift development moves fast. By the time you're reading this the version may not be v0.3.4 any more. You can use docker ps to identify the current version used.

Every time we execute osc command in the shell, a new Docker container is created for one-time use from OpenShift image, and its local copy of osc is executed.


Let’s make sure that it works:

 osc get pods


We should get back a list of several pods created by OpenShift and Fabric8.


Kubernetes basics



Kubernetes is a technology for provisioning and managing of Docker containers.


While the scope of Docker is one host running one Docker daemon, Kubernetes works at the level of many hosts each running a Docker daemon, and a Kubernetes agent called Kubelet. There is also a Kubernetes master node running a kubernetes daemon, providing central management, monitoring, and provisioning of components.


There are three basic types of components in Kubernetes:

Pods

Pod is a virtual server that is composed of one or more Docker containers - which are like processes in this virtual server. Each pod gets a newly allocated IP address, hostname, port space, and process space which are all shared by the docker containers of that pod (they can even communicate via SystemV IPC or POSIX message queues).


Services

Service is a front end portal to a set of pods providing the same service. Each service gets a newly allocated IP, where it listens on a specific port, and tunnels established connections to backend pods in round-robin fashion.


Replication controllers

These are agents that monitor pods. Each agent enforces that a specified number of instances of its monitored pod is available at any one time. If there are more it will randomly delete some, it there are less it will create new ones.



Every requested action at the level of Kubernetes occurs at the level of pods. While you can still directly interact with Docker containers using docker client tool, the idea of Kubernetes is that you shouldn’t. Any operation at Docker container level is supposed to be performed automatically by Kubernetes as necessary. If a certain Docker container started and monitored by Kubernetes dies, Kubernetes will create a new one.


When one pod needs to connect to another - like in our case Keycloak needs to connect to PostgreSQL - that should be done through a service. While individual pods come and go, constantly changing their IP addresses, the service is a more permanent component and also its IP address is thus more permanent.


Armed with that knowledge we can now define and create a new Keycloak cluster that uses PostgreSQL.


Creating a cluster using Kubernetes


There is an example container definition file available on GitHub that makes use of the same Docker images we used in the previous article.


In this example configuration we define three services:
  • postgres-service ... listens on port 5432 and tunnels to postgres pods to port 5432
  • keycloak-http-service … listens on port 80 and tunnels to keycloak pods to port 8080
  • keycloak-https-service … listens on port 443 and also tunnels to keycloak pods, but to port 8443


We then define two replication controllers:
  • postgres-controller … monitors postgres pods, and makes sure exactly one pod is available at any one time
  • keycloak-controller … monitors keycloak pods, and makes sure exactly two pods are available at any one time


And we define two pods:
  • postgres-pod … contains one docker container based on latest official ‘postgres’ image
  • keycloak-pod … contains one docker container based on latest jboss/keycloak-ha-postgres image


With this file we can now create, and start up our whole cluster with one line:


 osc create -f - < keycloak-kube.json


We can monitor progress of new pods coming up, by first listing pods:


$ osc get pods

POD                         IP            CONTAINER(S)                
keycloak-controller-559a8   172.17.0.12   keycloak-container
keycloak-controller-zorqg   172.17.0.13   keycloak-container
postgres-controller-exkqq   172.17.0.11   postgres-container          


(there are more columns, but I did not include them here)


What we are interested in here are the exact pod ids so we can attach to their output.


We can check how PostgreSQL is doing:


 osc log -f postgres-controller-exkqq


And then make sure each of the Keycloak containers started up properly, and established a cluster:

 osc log -f keycloak-controller-559a8


In my case the first container has started up without error, and I can see the line in the log that tells the cluster of two Keycloak instances has been established:


2015-04-02T09:26:18.683827888Z 09:26:18,678 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-1,shared=udp) ISPN000094: Received new cluster view: [keycloak-controller-559a8/keycloak|1] (2) [keycloak-controller-559a8/keycloak, keycloak-controller-zorqg/keycloak]


When things go wrong


Let’s also check the other instance:


 osc log -f keycloak-controller-zorqg



In my case I see a problem with the second instance - there is a nasty error:


2015-04-02T09:26:37.124344660Z 09:26:37,074 ERROR [org.keycloak.connections.jpa.updater.liquibase.LiquibaseJpaUpdaterProvider] (MSC service thread 1-1) Change Set META-INF/jpa-changelog-1.1.0.Final.xml::1.1.0.Final::sthorger@redhat.com failed.  Error: Error executing SQL ALTER TABLE public.EVENT_ENTITY RENAME COLUMN TIME TO EVENT_TIME: ERROR: column "time" does not exist: liquibase.exception.DatabaseException: Error executing SQL ALTER TABLE public.EVENT_ENTITY RENAME COLUMN TIME TO EVENT_TIME: ERROR: column "time" does not exist

...


What’s going on?


It turns out that Keycloak version that we used for the Docker image at the time of this writing contains a bug that appears when multiple Keycloak instances connecting to the same PostgreSQL database start up at the same time. The bug can be tracked in project’s JIRA.


In my case all I have to do is kill the problematic instance, and Kubernetes will create a new one. 

The proper handling would be for Kubernetes to detect that one pod has failed to start up properly, and kill it. But then Kubernetes would have to understand how to detect a fatal startup condition in a still running Keycloak process. As an alternative we could have Keycloak exit the JVM with error code when detecting improper start up. In that case Kubernetes would create another pod instance automatically.

 osc delete pod keycloak-controller-zorqg


Kubernetes will immediately determine that it should create another keycloak-pod to lift their count to two.


$ osc get pods

POD                         IP            CONTAINER(S)                
keycloak-controller-559a8   172.17.0.12   keycloak-container
keycloak-controller-xkq43   172.17.0.14   keycloak-container
postgres-controller-exkqq   172.17.0.11   postgres-container          


We can see another pod instance: keycloak-controller-xkq43 with a new IP address.


Let’s make sure it starts up:


 osc log -f postgres-controller-xkq43


This time the instance starts up without errors, and we can also see that a new JGroups cluster is established:


2015-04-02T10:09:32.615783260Z 10:09:32,615 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-1) ISPN000094: Received new cluster view: [keycloak-controller-559a8/keycloak|3] (2) [keycloak-controller-559a8/keycloak, keycloak-controller-xkq43/keycloak]


Making sure things work


We can now try to access Keycloak through each of the pods - just to make sure - since the proper way to access Keycloak now is through keycloak-service.


In my case the following two pod urls properly work (they can be accessed from Docker host): http://172.17.0.12:8080, and http://172.17.0.14:8080


The ultimate test is to use a keycloak-service IP address.


Let’s list the running services:


$ osc get services


NAME                    SELECTOR           IP              PORT
keycloak-http-service   name=keycloak-pod  172.30.17.192   80
keycloak-https-service  name=keycloak-pod  172.30.17.62    443
postgres-service        name=postgres-pod  172.30.17.246   5432


(there are more columns, but I did not include them here)


We can see all our services listed, and we can see their IP addresses. Here we’re interested in keycloak-http-service so let’s try to access Keycloak through it from Docker host: http://172.30.17.192


Note, that if you want to access this IP address from another host (not the one hosting Docker daemon) you would have to set up routing or port forwarding.


For example, using boot2docker on OS X, and accessing a VirtualBox instance running Docker daemon I have to go to native Terminal on OS X and type:


sudo route -n add 172.30.0.0/16 $DOCKER_IP


When browser establishes a TCP connection to port 80 of Keycloak service’s IP address, there is a tunneling proxy there that creates another connection to one of the Keycloak pods (chosen in round robin fashion), and tunnels all the traffic through to it. Each Keycloak instance will therefore see the client IP to be equal to the service IP. Also, during our browser session many connections will be established - half of them will be tunneled to one pod, the other half to the other pod. Since we have set up Keycloak in clustered mode it doesn’t matter which pod the request hits - they both use the same distributed cache, and consequently always generate the same response - without a need for sticky sessions.


Conclusion


We used an example Kubernetes configuration to start up PostgreSQL, and a cluster of two Keycloak instances on OpenShift 3 through Kubernetes, using the same Docker images available on DockerHub that we used in the previous article where we created the same kind of cluster - in that case using Docker directly.

For production quality scalable cloud we still need to provide a monitoring mechanism that detects when a Keycloak instance isn't operational.

Also worth noting is that Keycloak clustered setup used in our image requires multicast, and will only work when Keycloak pods are deployed on the same Docker host - the same Kubernetes worker node. Multicast is generally not available in production cloud environments, and the fact that it does work here is a side-effect of current implementation of OpenShift 3, and may change in the future. For a more proper cloud setup, a Kubernetes-aware direct TCP discovery mechanism should be configured on JGroups. One candidate solution for that is a kubeping project.

Also, for real high availability we should also make sure the database is highly available. In this example we used PostgreSQL, for which there are multiple ways to make it highly available, with different tradeoffs between data consistency and performance. Maybe a topic for another post.