Stolon on baremetal with consul - can't launch first keeper

Architecture:

  • node1 -> Consul Server [192.168.100.104]
  • sql1 -> Consul Agent + Postgres14 [192.168.100.101]
  • sql2 -> Consul Agent + Postgres14 [192.168.100.102]

Consul v1.6.9
Debian 11
Stolon v0.17.0

My steps:

  1. I installed Postgres14 on sql2 server and I created a custom database with some data.

  2. I ran following commands in psql:

    ALTER USER postgres WITH PASSWORD 'password';
    CREATE USER repluser WITH PASSWORD 'password';
    ALTER USER repluser REPLICATION;
    
  3. Successfully initialized cluster with:

    stolonctl init --cluster-name sql --store-backend consul
    
  4. I ran sentinel:

    stolon-sentinel --cluster-name sql --store-backend consul
    
    output - click to expand
    2022-03-23T15:08:26.164+0100    INFO    cmd/sentinel.go:2000    sentinel uid   {"uid": "0cd099f8"}
    2022-03-23T15:08:26.166+0100    INFO    cmd/sentinel.go:82      Trying to acquire sentinels leadership
    2022-03-23T15:08:26.190+0100    INFO    cmd/sentinel.go:89      sentinel leadership acquired
    2022-03-23T15:08:26.209+0100    INFO    cmd/sentinel.go:837     waiting for db {"db": "d27faa20", "keeper": "sql1"}
    2022-03-23T15:08:31.239+0100    WARN    cmd/sentinel.go:276     no keeper info available        {"db": "d27faa20", "keeper": "sql1"}
    2022-03-23T15:08:31.243+0100    INFO    cmd/sentinel.go:837     waiting for db  {"db": "d27faa20", "keeper": "sql1"}
    2022-03-23T15:08:36.315+0100    WARN    cmd/sentinel.go:276     no keeper info available        {"db": "d27faa20", "keeper": "sql1"}
    
  5. And keeper (postgres service stopped):

    stolon-keeper --cluster-name sql \
        --store-backend=consul --uid sql1 \
        --data-dir /var/lib/postgresql/14/main
        --pg-su-username=postgres
        --pg-su-password=password
        --pg-repl-username=repluser
        --pg-repl-password=password
        --pg-listen-address=192.168.100.101
        --pg-bin-path /usr/lib/postgresql/14/bin
    
    output errors - click to expand
    2022-03-23T15:16:00.443+0100    INFO    cmd/keeper.go:2091      exclusive lock on data dir taken
    2022-03-23T15:16:00.444+0100    INFO    cmd/keeper.go:569       keeper uid      {"uid": "sql1"}
    2022-03-23T15:16:00.447+0100    DEBUG   cmd/keeper.go:822       cd dump: (*cluster.ClusterData)(0xc000318a40)({
    FormatVersion: (uint64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:15:55.977848705 +0100 CET,
    Cluster: (*cluster.Cluster)(0xc000366240)({
    UID: (string) (len=8) "645cc58f",
    Generation: (int64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:13:34.622224551 +0100 CET,
    Spec: (*cluster.ClusterSpec)(0xc00038f440)({
    SleepInterval: (*cluster.Duration)(<nil>),
    RequestTimeout: (*cluster.Duration)(<nil>),
    ConvergenceTimeout: (*cluster.Duration)(<nil>),
    InitTimeout: (*cluster.Duration)(<nil>),
    SyncTimeout: (*cluster.Duration)(<nil>),
    DBWaitReadyTimeout: (*cluster.Duration)(<nil>),
    FailInterval: (*cluster.Duration)(<nil>),
    DeadKeeperRemovalInterval: (*cluster.Duration)(<nil>),
    ProxyCheckInterval: (*cluster.Duration)(<nil>),
    ProxyTimeout: (*cluster.Duration)(<nil>),
    MaxStandbys: (*uint16)(<nil>),
    MaxStandbysPerSender: (*uint16)(<nil>),
    MaxStandbyLag: (*uint32)(<nil>),
    SynchronousReplication: (*bool)(<nil>),
    MinSynchronousStandbys: (*uint16)(<nil>),
    MaxSynchronousStandbys: (*uint16)(<nil>),
    AdditionalWalSenders: (*uint16)(<nil>),
    AdditionalMasterReplicationSlots: ([]string) <nil>,
    UsePgrewind: (*bool)(<nil>),
    InitMode: (*cluster.ClusterInitMode)(0xc00033f830)((len=3) "new"),
    MergePgParameters: (*bool)(<nil>),
    Role: (*cluster.ClusterRole)(<nil>),
    NewConfig: (*cluster.NewConfig)(<nil>),
    PITRConfig: (*cluster.PITRConfig)(<nil>),
    ExistingConfig: (*cluster.ExistingConfig)(<nil>),
    StandbyConfig: (*cluster.StandbyConfig)(<nil>),
    DefaultSUReplAccessMode: (*cluster.SUReplAccessMode)(<nil>),
    PGParameters: (cluster.PGParameters) <nil>,
    PGHBA: ([]string) <nil>,
    AutomaticPgRestart: (*bool)(<nil>)
    }),
    Status: (cluster.ClusterStatus) {
    CurrentGeneration: (int64) 0,
    Phase: (cluster.ClusterPhase) (len=12) "initializing",
    Master: (string) (len=8) "49c40051"
    }
    }),
    Keepers: (cluster.Keepers) (len=1) {
    (string) (len=4) "sql1": (*cluster.Keeper)(0xc0003a07e0)({
    UID: (string) (len=4) "sql1",
    Generation: (int64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:15:55.977864513 +0100 CET,
    Spec: (*cluster.KeeperSpec)(0x23eeb88)({
    }),
    Status: (cluster.KeeperStatus) {
        Healthy: (bool) false,
        LastHealthyTime: (time.Time) 2022-03-23 15:15:50.913820346 +0100 CET,
        BootUUID: (string) (len=36) "2f72e54a-9402-42a5-bbdd-36dbaeb7d52b",
        PostgresBinaryVersion: (cluster.PostgresBinaryVersion) {
        Maj: (int) 14,
        Min: (int) 2
        },
        ForceFail: (bool) false,
        CanBeMaster: (*bool)(0xc0003a624e)(true),
        CanBeSynchronousReplica: (*bool)(0xc0003a624f)(true)
    }
    })
    },
    DBs: (cluster.DBs) (len=1) {
    (string) (len=8) "49c40051": (*cluster.DB)(0xc00034e1e0)({
    UID: (string) (len=8) "49c40051",
    Generation: (int64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:13:34.622225289 +0100 CET,
    Spec: (*cluster.DBSpec)(0xc000164460)({
        KeeperUID: (string) (len=4) "sql1",
        RequestTimeout: (cluster.Duration) 10s,
        MaxStandbys: (uint16) 20,
        SynchronousReplication: (bool) false,
        UsePgrewind: (bool) false,
        AdditionalWalSenders: (uint16) 5,
        AdditionalReplicationSlots: ([]string) <nil>,
        InitMode: (cluster.DBInitMode) (len=3) "new",
        NewConfig: (*cluster.NewConfig)(<nil>),
        PITRConfig: (*cluster.PITRConfig)(<nil>),
        PGParameters: (cluster.PGParameters) <nil>,
        PGHBA: ([]string) <nil>,
        Role: (common.Role) (len=6) "master",
        FollowConfig: (*cluster.FollowConfig)(<nil>),
        Followers: ([]string) {
        },
        IncludeConfig: (bool) true,
        SynchronousStandbys: ([]string) <nil>,
        ExternalSynchronousStandbys: ([]string) <nil>
    }),
    Status: (cluster.DBStatus) {
        Healthy: (bool) false,
        CurrentGeneration: (int64) 0,
        ListenAddress: (string) "",
        Port: (string) "",
        SystemID: (string) "",
        TimelineID: (uint64) 0,
        XLogPos: (uint64) 0,
        TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
        PGParameters: (cluster.PGParameters) <nil>,
        CurSynchronousStandbys: ([]string) <nil>,
        SynchronousStandbys: ([]string) <nil>,
        OlderWalFile: (string) ""
    }
    })
    },
    Proxy: (*cluster.Proxy)(0xc0003662a0)({
    UID: (string) "",
    Generation: (int64) 0,
    ChangeTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
    Spec: (cluster.ProxySpec) {
    MasterDBUID: (string) "",
    EnabledProxies: ([]string) <nil>
    },
    Status: (cluster.ProxyStatus) {
    }
    })
    })
    
    2022-03-23T15:16:00.457+0100    DEBUG   postgresql/postgresql.go:625    execing cmd     {"cmd": "/usr/lib/postgresql/14/bin/postgres -V"}
    2022-03-23T15:16:00.459+0100    DEBUG   cmd/keeper.go:1045      cd dump: (*cluster.ClusterData)(0xc00007ce40)({
    FormatVersion: (uint64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:15:55.977848705 +0100 CET,
    Cluster: (*cluster.Cluster)(0xc00007f6e0)({
    UID: (string) (len=8) "645cc58f",
    Generation: (int64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:13:34.622224551 +0100 CET,
    Spec: (*cluster.ClusterSpec)(0xc00003d7a0)({
    SleepInterval: (*cluster.Duration)(<nil>),
    RequestTimeout: (*cluster.Duration)(<nil>),
    ConvergenceTimeout: (*cluster.Duration)(<nil>),
    InitTimeout: (*cluster.Duration)(<nil>),
    SyncTimeout: (*cluster.Duration)(<nil>),
    DBWaitReadyTimeout: (*cluster.Duration)(<nil>),
    FailInterval: (*cluster.Duration)(<nil>),
    DeadKeeperRemovalInterval: (*cluster.Duration)(<nil>),
    ProxyCheckInterval: (*cluster.Duration)(<nil>),
    ProxyTimeout: (*cluster.Duration)(<nil>),
    MaxStandbys: (*uint16)(<nil>),
    MaxStandbysPerSender: (*uint16)(<nil>),
    MaxStandbyLag: (*uint32)(<nil>),
    SynchronousReplication: (*bool)(<nil>),
    MinSynchronousStandbys: (*uint16)(<nil>),
    MaxSynchronousStandbys: (*uint16)(<nil>),
    AdditionalWalSenders: (*uint16)(<nil>),
    AdditionalMasterReplicationSlots: ([]string) <nil>,
    UsePgrewind: (*bool)(<nil>),
    InitMode: (*cluster.ClusterInitMode)(0xc00004b5f0)((len=3) "new"),
    MergePgParameters: (*bool)(<nil>),
    Role: (*cluster.ClusterRole)(<nil>),
    NewConfig: (*cluster.NewConfig)(<nil>),
    PITRConfig: (*cluster.PITRConfig)(<nil>),
    ExistingConfig: (*cluster.ExistingConfig)(<nil>),
    StandbyConfig: (*cluster.StandbyConfig)(<nil>),
    DefaultSUReplAccessMode: (*cluster.SUReplAccessMode)(<nil>),
    PGParameters: (cluster.PGParameters) <nil>,
    PGHBA: ([]string) <nil>,
    AutomaticPgRestart: (*bool)(<nil>)
    }),
    Status: (cluster.ClusterStatus) {
    CurrentGeneration: (int64) 0,
    Phase: (cluster.ClusterPhase) (len=12) "initializing",
    Master: (string) (len=8) "49c40051"
    }
    }),
    Keepers: (cluster.Keepers) (len=1) {
    (string) (len=4) "sql1": (*cluster.Keeper)(0xc0001966c0)({
    UID: (string) (len=4) "sql1",
    Generation: (int64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:15:55.977864513 +0100 CET,
    Spec: (*cluster.KeeperSpec)(0x23eeb88)({
    }),
    Status: (cluster.KeeperStatus) {
        Healthy: (bool) false,
        LastHealthyTime: (time.Time) 2022-03-23 15:15:50.913820346 +0100 CET,
        BootUUID: (string) (len=36) "2f72e54a-9402-42a5-bbdd-36dbaeb7d52b",
        PostgresBinaryVersion: (cluster.PostgresBinaryVersion) {
        Maj: (int) 14,
        Min: (int) 2
        },
        ForceFail: (bool) false,
        CanBeMaster: (*bool)(0xc000040f4e)(true),
        CanBeSynchronousReplica: (*bool)(0xc000040f4f)(true)
    }
    })
    },
    DBs: (cluster.DBs) (len=1) {
    (string) (len=8) "49c40051": (*cluster.DB)(0xc0001d2000)({
    UID: (string) (len=8) "49c40051",
    Generation: (int64) 1,
    ChangeTime: (time.Time) 2022-03-23 15:13:34.622225289 +0100 CET,
    Spec: (*cluster.DBSpec)(0xc0001d4000)({
        KeeperUID: (string) (len=4) "sql1",
        RequestTimeout: (cluster.Duration) 10s,
        MaxStandbys: (uint16) 20,
        SynchronousReplication: (bool) false,
        UsePgrewind: (bool) false,
        AdditionalWalSenders: (uint16) 5,
        AdditionalReplicationSlots: ([]string) <nil>,
        InitMode: (cluster.DBInitMode) (len=3) "new",
        NewConfig: (*cluster.NewConfig)(<nil>),
        PITRConfig: (*cluster.PITRConfig)(<nil>),
        PGParameters: (cluster.PGParameters) <nil>,
        PGHBA: ([]string) <nil>,
        Role: (common.Role) (len=6) "master",
        FollowConfig: (*cluster.FollowConfig)(<nil>),
        Followers: ([]string) {
        },
        IncludeConfig: (bool) true,
        SynchronousStandbys: ([]string) <nil>,
        ExternalSynchronousStandbys: ([]string) <nil>
    }),
    Status: (cluster.DBStatus) {
        Healthy: (bool) false,
        CurrentGeneration: (int64) 0,
        ListenAddress: (string) "",
        Port: (string) "",
        SystemID: (string) "",
        TimelineID: (uint64) 0,
        XLogPos: (uint64) 0,
        TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
        PGParameters: (cluster.PGParameters) <nil>,
        CurSynchronousStandbys: ([]string) <nil>,
        SynchronousStandbys: ([]string) <nil>,
        OlderWalFile: (string) ""
    }
    })
    },
    Proxy: (*cluster.Proxy)(0xc00007f740)({
    UID: (string) "",
    Generation: (int64) 0,
    ChangeTime: (time.Time) 0001-01-01 00:00:00 +0000 UTC,
    Spec: (cluster.ProxySpec) {
    MasterDBUID: (string) "",
    EnabledProxies: ([]string) <nil>
    },
    Status: (cluster.ProxyStatus) {
    }
    })
    })
    
    2022-03-23T15:16:00.459+0100    INFO    cmd/keeper.go:1094      our db boot UID is different than the cluster data one, waiting for it to be updated    {"bootUUID": "c8626456-eed1-429d-81a6-492fd0c6aa03", "clusterBootUUID": "2f72e54a-9402-42a5-bbdd-36dbaeb7d52b"}
    2022-03-23T15:16:00.460+0100    ERROR   cmd/keeper.go:1096      failed to stop pg instance      {"error": "cannot get instance state: exit status 1"}
    2022-03-23T15:16:00.469+0100    DEBUG   cmd/keeper.go:799       pgstate dump: (*cluster.PostgresState)(0xc00023a6e0)({
    UID: (string) (len=8) "a0cb6ea7",
    Generation: (int64) 0,
    ListenAddress: (string) (len=15) "192.168.100.101",
    Port: (string) (len=4) "5432",
    Healthy: (bool) false,
    SystemID: (string) "",
    TimelineID: (uint64) 0,
    XLogPos: (uint64) 0,
    TimelinesHistory: (cluster.PostgresTimelinesHistory) <nil>,
    PGParameters: (common.Parameters) <nil>,
    SynchronousStandbys: ([]string) <nil>,
    OlderWalFile: (string) ""
    })
    
    ^C2022-03-23T15:16:04.467+0100  DEBUG   cmd/keeper.go:1894      got signal      {"signal": "interrupt"}
    2022-03-23T15:16:04.467+0100    DEBUG   cmd/keeper.go:841       stopping stolon keeper
    2022-03-23T15:16:04.468+0100    ERROR   cmd/keeper.go:843       failed to stop pg instance      {"error": "cannot get instance state: exit status 1"}
    
  6. Also I tried with started Postgresql service, but without luck.

  7. Configs:

    • postgresql.conf:
      listen_addresses = '*'
      password_encryption = md5
      
    • pg_hba.conf:
       local   all             postgres                                peer
       local   all             all                                     peer
       host    all             all             127.0.0.1/32            md5
       host    all             all             192.168.100.0/24        md5
       host    all             all             ::1/128                 md5
       host    all             all             127.0.0.1/32            scram-sha-256
       host    all             all             192.168.100.0/24        scram-sha-256
       host    all             all             10.0.0.0/16             scram-sha-256
       host    all             all             ::1/128                 scram-sha-256
       host    replication     all             127.0.0.1/32            md5
       host    replication     all             ::1/128                 md5
       host    replication     all             127.0.0.1/32            scram-sha-256
       host    replication     all             ::1/128                 scram-sha-256
      `
      
      

Am I doing something wrong?
Thanks for your help!