Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

Container not restarted on remaining nodes after reboot #1621

@jeanfabrice

Description

@jeanfabrice

Hi,

I'm trying to have my containers survive a node failure, being rescheduled and restarted on surviving nodes in a 3-nodes Coreos beta channel cluster (1068.3.0).

I'm facing the following issue when shutting down a node member: Some containers get randomly restarted, some others don't. According to the fleet log, the root cause seems to be that the corresponding sidekick unit is not available on disk when Fleet decides to start a service unit.

Here is a unit and its sidekick counterpart (I'm using -announce to suffix the sidekick unit name) :

core@coreos41 ~ $ fleetctl cat grafana.service
[Unit]
Description=Grafana
After=docker.service
After=dockerdata.mount
After=flanneld.service
Requires=docker.service
Requires=dockerdata.mount
Requires=flanneld.service
Requires=grafana-announce.service

[Service]
EnvironmentFile=/etc/environment
KillMode=none
Restart=on-failure
RestartSec=5s
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill grafana
ExecStartPre=-/usr/bin/docker rm grafana
ExecStartPre=/usr/bin/docker pull grafana/grafana
ExecStart=/usr/bin/docker run    --env="GF_USERS_ALLOW_SIGN_UP=false"    --env="GF_AUTH_ANONYMOUS_ENABLED=true"    --env="GF_AUTH_ANONYMOUS_ORG_ROLE=Admin"    --env="GF_ANALYTICS_REPORTING_ENABLED=false"    --hostname=grafana    --name=grafana    --publish=${COREOS_PUBLIC_IPV4}:3000:3000    --volume=/dockerdata/grafana/var/lib/grafana:/var/lib/grafana    --volume=/dockerdata/grafana/var/lib/grafana/plugins:/var/lib/grafana/plugins    --volume=/dockerdata/grafana/etc/grafana:/etc/grafana    --volume=/dockerdata/grafana/var/log/grafana:/var/log/grafana    grafana/grafana
ExecStop=/usr/bin/docker stop grafana

core@coreos41 ~ $ fleetctl cat grafana-announce.service
[Unit]
Description=Register Grafana service
BindsTo=grafana.service
After=grafana.service

[Service]
EnvironmentFile=/etc/environment
ExecStart=/bin/sh -c "while true; do etcdctl set /skydns/home/skydns/grafana '{\"host\": \"${COREOS_PUBLIC_IPV4}\", \"port\":3000}' --ttl 60; sleep 45; done"
ExecStop=/usr/bin/etcdctl rm /skydns/home/skydns/grafana

[X-Fleet]
MachineOf=grafana.service

And here is the fleet log:

Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit registry-announce.service (399b)
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit gitlab.service (823b)
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit zabbix-db.service (762b)
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit grafana.service (1061b)
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit registry.service (813b)
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit haproxy.service (737b)
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit gitlab-announce.service (380b)
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:182: Instructing systemd to reload units
Jun 29 06:24:24 coreos41 fleetd[3547]: ERROR manager.go:129: Failed to trigger systemd unit grafana.service start: Unit grafana-announce.service not found.
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:127: Triggered systemd unit registry.service start: job=1702
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:127: Triggered systemd unit gitlab.service start: job=1787
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO manager.go:127: Triggered systemd unit zabbix-db.service start: job=1872
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO manager.go:127: Triggered systemd unit haproxy.service start: job=1956
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=registry-announce.service reason="unit scheduled here but not loaded"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=gitlab.service reason="unit scheduled here but not loaded"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=zabbix-db.service reason="unit scheduled here but not loaded"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=grafana.service reason="unit scheduled here but not loaded"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=registry.service reason="unit scheduled here but not loaded"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=haproxy.service reason="unit scheduled here but not loaded"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=gitlab-announce.service reason="unit scheduled here but not loaded"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=grafana.service reason="unit currently loaded but desired state is launched"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=registry.service reason="unit currently loaded but desired state is launched"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=gitlab.service reason="unit currently loaded but desired state is launched"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=zabbix-db.service reason="unit currently loaded but desired state is launched"
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=haproxy.service reason="unit currently loaded but desired state is launched"
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit grafana-announce.service (386b)
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit zabbix-announce.service (380b)
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit zabbix.service (968b)
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO manager.go:182: Instructing systemd to reload units
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO manager.go:127: Triggered systemd unit zabbix.service start: job=2112
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=grafana-announce.service reason="unit scheduled here but not loaded"
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=zabbix-announce.service reason="unit scheduled here but not loaded"
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=zabbix.service reason="unit scheduled here but not loaded"
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=zabbix.service reason="unit currently loaded but desired state is launched"

I have read many issues about similar behaviours, but all are now closed and seem to be related to Fleet v0.9.1.
I'm running the following :

core@coreos41 ~ $ cat /etc/os-release 
NAME=CoreOS
ID=coreos
VERSION=1068.3.0
VERSION_ID=1068.3.0
BUILD_ID=2016-06-25-0103
PRETTY_NAME="CoreOS 1068.3.0 (MoreOS)"
ANSI_COLOR="1;32"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://github.com/coreos/bugs/issues"
core@coreos41 ~ $ fleet --version
fleetd version 0.11.7
core@coreos41 ~ $ docker --version 
Docker version 1.10.3, build 3cd164c

Are there some misconfiguration in my units ?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions