Kubernetes store timouts

landor · April 20, 2020, 10:24am

Hi.

We are having problems with the kubernetes store. We experience random store timeouts and therefore db disconnects in our cluster. It happens under load but also on random times during the night without any load.
Do you have an idea what could cause the disconnects without any load?

E0420 09:39:49.035064       1 leaderelection.go:331] error retrieving resource lock bms-databases/stolon-cluster-bms-postgres-stolon: Get https://10.43.0.1:443/api/v1/namespaces/bms-databases/configmaps/stolon-cluster-bms-postgres-stolon?timeout=5s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)

I saw the new --store-timeout parameter in v0.16. I’ll try it out as soon as the parameter is available in the helm chart.

Thanks

sgotti · April 20, 2020, 11:27am

@landor That means that your k8s api are responding slowly. Increasing the store timeout isn’t a real fix since the proxies will timeout anyway after the proxy timeout interval and you’ll have to also increase them. My suggestion is to use a different dedicated store like etcd. See also the doc with all the other downsides of using the k8s api:

github.com

sorintlab/stolon/blob/master/doc/architecture.md#kubernetes-store-backend

## Stolon Architecture and Requirements

### Components

Stolon is composed of 3 main components

* keeper: it manages a PostgreSQL instance converging to the clusterview provided by the sentinel(s).
* sentinel: it discovers and monitors keepers and calculates the optimal clusterview.
* proxy: the client's access point. It enforce connections to the right PostgreSQL master and forcibly closes connections to old masters.

![Stolon architecture](architecture_small.png)

### Requirements

#### Keepers

Every keeper MUST have a different UID that can be manually provided (`--uid` option) or will be generated. After the first start the keeper id (provided or generated) is saved inside the keeper data directory.

Every keeper MUST have a persistent data directory (no ephemeral volumes like k8s `emptyDir`) or you'll lose your data if all the keepers are stopped at the same time (since at restart no valid standby to failover will be available).

This file has been truncated. show original

landor · April 22, 2020, 11:44am

@sgotti Thanks for the info.
We are considering moving the kubernetes etcd storage out of the main cluster where our workloads run.
What do you think about that?
Will it also fix our problem or do you think we still need to use a dedicated etcd storage for our stolon instances?
Thank You