Db failed to initialize or resync

@sgotti

While starting the keeper i am getting below error:

db failed to initialize or resync

before running the keeper below commands were issued:

  1. /bin/stolonctl --cluster-name stolon-cluster --store-backend=etcdv3 init -> No output
  2. ./bin/stolon-sentinel --cluster-name stolon-cluster --store-backend=etcdv3

Running this on multipass Ubuntu VM (18.04)

Thanks in advance.
Raghav

@Raghavsalotra Please provide the keeper logs

@sgotti here you go.

ubuntu@node1:~/stolon-v0.16.0-linux-amd64$ sudo ./bin/stolon-keeper --cluster-name stolon-cluster --store-backend=etcdv3 --uid postgres0 --data-dir data/postgres0 --pg-su-password=admin@123 --pg-repl-username=root --pg-bin-path /usr/lib/postgresql/10/bin/  --pg-repl-password=admin@123 --pg-listen-address=localhost

2020-03-19T11:39:56.262+0100 WARN cmd/keeper.go:1929 provided --pg-listen-address “localhost”: is not an ip address but a hostname. This will be advertized to the other components and may have undefined behaviors if resolved differently by other hosts
2020-03-19T11:39:56.293+0100 WARN cmd/keeper.go:1937 provided --pg-listen-address “localhost” is a loopback ip. This will be advertized to the other components and communication will fail if they are on different hosts
2020-03-19T11:39:56.295+0100 WARN cmd/keeper.go:2008 superuser name and replication user name are the same. Different users are suggested.
2020-03-19T11:39:56.298+0100 INFO cmd/keeper.go:2039 exclusive lock on data dir taken
2020-03-19T11:39:56.302+0100 INFO cmd/keeper.go:525 keeper uid {“uid”: “postgres0”}
2020-03-19T11:39:56.425+0100 INFO cmd/keeper.go:1047 our db boot UID is different than the cluster data one, waiting for it to be updated {“bootUUID”: “15bd2811-e3d1-4a6b-8b2b-4537aad371af”, “clusterBootUUID”: “92f70be6-1eed-4596-9fc2-5a0cc2823beb”}
2020-03-19T11:39:56.438+0100 ERROR cmd/keeper.go:1049 failed to stop pg instance {“error”: “cannot get instance state: exit status 1”}

2020-03-19T11:40:01.528+0100 ERROR cmd/keeper.go:1063 db failed to initialize or resync
2020-03-19T11:40:01.530+0100 ERROR cmd/keeper.go:1066 failed to stop pg instance {“error”: “cannot get instance state: exit status 1”}
2020-03-19T11:40:06.547+0100 ERROR cmd/keeper.go:1063 db failed to initialize or resync
2020-03-19T11:40:06.590+0100 ERROR cmd/keeper.go:1066 failed to stop pg instance {“error”: “cannot get instance state: exit status 1”}
2020-03-19T11:40:11.628+0100 ERROR cmd/keeper.go:1063 db failed to initialize or resync
2020-03-19T11:40:11.655+0100 ERROR cmd/keeper.go:1066 failed to stop pg instance {“error”: “cannot get instance state: exit status 1”}

2020-03-19T11:40:16.672+0100 ERROR cmd/keeper.go:1063 db failed to initialize or resync
2020-03-19T11:40:16.680+0100 ERROR cmd/keeper.go:1066 failed to stop pg instance {“error”: “cannot get instance state: exit status 1”}
^C2020-03-19T11:40:17.143+0100 ERROR cmd/keeper.go:796 failed to stop pg instance {“error”: “cannot get instance state: exit status 1”}

The various postgres commands are failing without an error message. You should verify that the provided pg bin path is correct and try to execute them by hand to see if they work.

provided pg bin path is correct.

what are those commands which need to be run by hand?
The first command we used to init db did not print any log or so on console it just completes. Are there some steps which need to be run?

And when you say various commands can you be more specific over here?

initdb should print multiple lines of log. You should check its exit code.

Any clue when it does not print? Where will you check the exit code? below is the output i got from init command.

ubuntu@node1:~/stolon-v0.16.0-linux-amd64$ ./bin/stolonctl --cluster-name stolon-cluster -- 
store-backend=etcdv3 init
WARNING: The current cluster data will be removed
WARNING: The databases managed by the keepers will be overwritten depending on the 
provided cluster spec.
Are you sure you want to continue? [yes/no] yes
ubuntu@node1:~/stolon-v0.16.0-linux-amd64$

Forgot for one second stolon and check that your postgres installation works manually. Check that the path you’re using contains the right files (ls output) and that initdb works. What distribution are you using? How did you installed postgres?

I am using Ubuntu 18.04 and Postgres 10.3.
Db is working all fine : able to login and create tables.
below are the contents of pg-bin-path

    ubuntu@node1:~/stolon-v0.16.0-linux-amd64$ ls /usr/lib/postgresql/10/bin/
clusterdb  createdb  createuser  dropdb  dropuser  initdb  oid2name  pg_archivecleanup  pg_basebackup  pg_controldata  pg_ctl  pg_dump  pg_dumpall  pg_isready  pg_receivewal  pg_recvlogical  pg_resetwal  pg_restore  pg_rewind  pg_standby  pg_test_fsync  pg_test_timing  pg_upgrade  pg_waldump  pgbench  postgres  postmaster  psql  reindexdb  vacuumdb  vacuumlo

@sgotti

Now i was able to get away with that error which was because of permission issue:
Success. You can now start the database server using:

/usr/lib/postgresql/10/bin/pg_ctl -D /home/ubuntu/data/postgres0/postgres -l logfile start

2020-03-19T17:07:25.626+0100	INFO	postgresql/postgresql.go:319	starting 
database
2020-03-19 17:07:25.651 CET [31246] LOG:  could not bind IPv4 address "127.0.0.1": 
Address already in use
2020-03-19 17:07:25.651 CET [31246] HINT:  Is another postmaster already running on port 
5432? If not, wait a few seconds and retry.
2020-03-19 17:07:25.652 CET [31246] WARNING:  could not create listen socket for 
"127.0.0.1"
2020-03-19 17:07:25.652 CET [31246] FATAL:  could not create any TCP/IP sockets
2020-03-19 17:07:25.653 CET [31246] LOG:  database system is shut down
2020-03-19T17:07:25.863+0100	ERROR	cmd/keeper.go:1140	failed to start 
instance	{"error": "postgres exited unexpectedly"}

Note:
db is running on the machine and listening on 127.0.0.1 and when it says postgres exited unexpectedly db is actually running in background.