Skip to content

Hhydra_pmi_proxy returns non-zero exit code when task is launched #2

@rstevens011

Description

@rstevens011

this project hasnt been updated in some time and I am trying to get it to work with newer versions of Mesos, MPICH and Zookeeper. I had to make several adjustments to the mrun.py script to get it to work with the newer eggs that mesos provides. Changes made were:

diff --git a/mrun.py b/mrun.py
index fa77978..c5e92b9 100644
--- a/mrun.py
+++ b/mrun.py
@@ -1,7 +1,8 @@
-#!/usr/bin/env python
+7#!/usr/bin/env python

-import mesos
-import mesos_pb2
+import mesos.interface as mesos
+import mesos.interface.mesos_pb2 as mesos_pb2
+import mesos.scheduler

 import os
 import logging
@@ -68,7 +73,7 @@ def finalizeSlaves(callbacks):

   logging.info("Done finalizing slaves")

-class HydraScheduler(mesos.Scheduler):
+class HydraScheduler(mesos.interface.Scheduler):
@@ -250,7 +257,7 @@ if __name__ == "__main__":

   work_dir = tempfile.mkdtemp()

-  driver = mesos.MesosSchedulerDriver(
+  driver = mesos.scheduler.MesosSchedulerDriver(
     scheduler,
     framework,
     args[0])

In addition the binaries in export/bin and libraries in export/lib were updated as well to reflect the new versions below. My test environment is zookeeper, the mesos slave and master, hadoop as well as this project on all one the same host. Because I installed the mesos interface and mesos.scheduler eggs manually via easy_install I did not need to install the version provided with this project in the Makefile. I was able to compile and upload the hello world app and upload it to the HDFS name node. This is the result of running the mrun.py script (Debug flag is set in the mrun shell wrapper script):

[root@ip-10-206-2-108 mesos-hydra]# ./mrun -N 1 -n 1 "zk://10.206.2.108:2181/mesos" ./hello_world
INFO:root:Connecting to Mesos master zk://10.206.2.108:2181/mesos
INFO:root:Total processes 1
INFO:root:Total nodes 1
INFO:root:Procs per node 1
INFO:root:Cores per node 1
2017-02-28 11:27:53,867:2948(0x7fdc0abcc700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8
2017-02-28 11:27:53,867:2948(0x7fdc0abcc700):ZOO_INFO@log_env@730: Client environment:host.name=ip-10-206-2-108.ec2.internal
2017-02-28 11:27:53,867:2948(0x7fdc0abcc700):ZOO_INFO@log_env@737: Client environment:os.name=Linux
2017-02-28 11:27:53,868:2948(0x7fdc0abcc700):ZOO_INFO@log_env@738: Client environment:os.arch=3.10.0-514.6.2.el7.x86_64
2017-02-28 11:27:53,868:2948(0x7fdc0abcc700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Fri Feb 17 19:21:31 EST 2017
2017-02-28 11:27:53,868:2948(0x7fdc0abcc700):ZOO_INFO@log_env@747: Client environment:user.name=ec2-user
2017-02-28 11:27:53,868:2948(0x7fdc0abcc700):ZOO_INFO@log_env@755: Client environment:user.home=/root
2017-02-28 11:27:53,868:2948(0x7fdc0abcc700):ZOO_INFO@log_env@767: Client environment:user.dir=/home/hadoop/mesos-hydra
2017-02-28 11:27:53,868:2948(0x7fdc0abcc700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=10.206.2.108:2181 sessionTimeout=10000 watcher=0x7fdc1646cb2a sessionId=0 sessionPasswd=<null> context=0x7fdbf8000c20 flags=0
I0228 11:27:53.868440  2948 sched.cpp:226] Version: 1.1.0
2017-02-28 11:27:53,869:2948(0x7fdc091b6700):ZOO_INFO@check_events@1728: initiated connection to server [10.206.2.108:2181]
2017-02-28 11:27:53,871:2948(0x7fdc091b6700):ZOO_INFO@check_events@1775: session establishment complete on server [10.206.2.108:2181], sessionId=0x15a856e6d320005, negotiated timeout=10000
I0228 11:27:53.872299  2952 group.cpp:340] Group process (zookeeper-group(1)@10.206.2.108:35453) connected to ZooKeeper
I0228 11:27:53.872354  2952 group.cpp:828] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0228 11:27:53.872369  2952 group.cpp:418] Trying to create path '/mesos' in ZooKeeper
I0228 11:27:53.873971  2951 detector.cpp:152] Detected a new leader: (id='14')
I0228 11:27:53.874215  2953 group.cpp:697] Trying to get '/mesos/json.info_0000000014' in ZooKeeper
I0228 11:27:53.876013  2956 zookeeper.cpp:259] A new leading master (UPID=master@10.206.2.108:5050) is detected
I0228 11:27:53.876211  2954 sched.cpp:330] New master detected at master@10.206.2.108:5050
I0228 11:27:53.876564  2954 sched.cpp:341] No credentials provided. Attempting to register without authentication
I0228 11:27:53.878914  2954 sched.cpp:743] Framework registered with 5cf46fd1-92b2-4f16-b92b-f434c853e2c7-0001
INFO:root:Registered with framework ID 5cf46fd1-92b2-4f16-b92b-f434c853e2c7-0001
Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/__init__.py", line 851, in emit
    msg = self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 328, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file mrun.py, line 94
INFO:root:Launching proxy on offer value: "5cf46fd1-92b2-4f16-b92b-f434c853e2c7-O1"
 from 10.206.2.108
INFO:root:Replying to offer: launching proxy 0 on host 10.206.2.108
INFO:root:Call-back at 10.206.2.108:31000
Traceback (most recent call last):
  File "/usr/lib64/python2.7/logging/__init__.py", line 851, in emit
    msg = self.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 724, in format
    return fmt.format(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 464, in format
    record.message = record.getMessage()
  File "/usr/lib64/python2.7/logging/__init__.py", line 328, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Logged from file mrun.py, line 94
INFO:root:Finalize slaves
INFO:root:about to execute mpiexec
INFO:root:in slave loop
'HYDRA_LAUNCH: /tmp/tmpeHgagc/./export/bin/hydra_pmi_proxy --control-port 10.206.2.108:38010 --rmk user --launcher manual --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0 \n'
INFO:root:None
INFO:root:Done finalizing slaves
ERROR:root:A task finished unexpectedly: Command exited with status 1
I0228 11:27:59.428092  2951 sched.cpp:1995] Asked to stop the driver
I0228 11:27:59.428184  2951 sched.cpp:1187] Stopping framework 5cf46fd1-92b2-4f16-b92b-f434c853e2c7-0001
2017-02-28 11:27:59,438:2948(0x7fdc0c3cf700):ZOO_INFO@zookeeper_close@2526: Closing zookeeper sessionId=0x15a856e6d320005 to [10.206.2.108:2181]

(note: I added some additional logging here and there as you can see)

The problem is around 'HYDRA_LAUNCH: /tmp/tmpeHgagc/./export/bin/hydra_pmi_proxy --control-port 10.206.2.108:38010 --rmk user --launcher manual --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0 \n'. This command returns a non zero exit code. The version of hydra_pmi_proxy is the one that is in MPICH 3.2. Documentation on this command is scarce and it appears to be an internal helper script for mpiexec's purposes.

I tried to add some debugging to the hydra-proxy.py script in order to see the stderr and/or stdout of this command but could not find a way to get to output to stdout/stderr of my tty.

Zookeeper: 3.4.9
Mesos: 1.1.0
MPICH: 3.2
Hadoop 2.6.0
CentOS 7
Linux 3.10.0-514.6.2.el7.x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions