发表文章

[Python] 反相失败的泰坦 RP fails on Titan[radical.pilot]

jdakka 2017-10-9 233

我运行了一个测试的例子, 使用1试点, 16 核心在泰坦-ortelib (RP 版本 0.46)
我得到以下错误:

 Getting Started (RP version 0.46)                                              
================================================================================

new session: [rp.session.titan-ext7.jdakka.017339.0001]                        \
database   : [mongodb://rp:rp@ds015335.mlab.com:15335/rp]Traceback (most recent call last):
  File "pilot.py", line 38, in <module>
    session = rp.Session()
  File "/lustre/atlas2/csc230/proj-shared/mskcc/rp_experiments/venv/lib/python2.7/site-packages/radical/pilot/session.py", line 257, in __init__
    self._components = ruc.start_components(self._cfg, self, self._log)
  File "/lustre/atlas2/csc230/proj-shared/mskcc/rp_experiments/venv/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 252, in start_components
    comp.start()
  File "/lustre/atlas2/csc230/proj-shared/mskcc/rp_experiments/venv/lib/python2.7/site-packages/radical/utils/process.py", line 462, in start
    timeout))
RuntimeError: unexpected child message (alive) [5.0]

从根消息:

PBS Job Id: 3458802
Job Name:   pilot.0000
Exec host:  2342/0-15
An error has occurred processing your job, see below.
request to copy stageout files failed on node '2342' for job 3458802

Unable to copy file /autofs/nccs-svm1_home1/jdakka/3458802.OU to /lustre/atlas/scratch/jdakka/csc230/radical.pilot.sandbox/rp.session.titan-ext4.jdakka.017339.0000/pilot.0000/bootstrap_1.out, error 1
*** error from copy
/bin/cp: cannot stat `/autofs/nccs-svm1_home1/jdakka/3458802.OU': No such file or directory
*** end error output

Unlink of stage out file /autofs/nccs-svm1_home1/jdakka/3458802.ER failed
原文:

I ran a test example using 1 pilot, 16 cores on Titan-ortelib (RP version 0.46)
I get the following errors:

 Getting Started (RP version 0.46)                                              
================================================================================

new session: [rp.session.titan-ext7.jdakka.017339.0001]                        \
database   : [mongodb://rp:rp@ds015335.mlab.com:15335/rp]Traceback (most recent call last):
  File "pilot.py", line 38, in <module>
    session = rp.Session()
  File "/lustre/atlas2/csc230/proj-shared/mskcc/rp_experiments/venv/lib/python2.7/site-packages/radical/pilot/session.py", line 257, in __init__
    self._components = ruc.start_components(self._cfg, self, self._log)
  File "/lustre/atlas2/csc230/proj-shared/mskcc/rp_experiments/venv/lib/python2.7/site-packages/radical/pilot/utils/component.py", line 252, in start_components
    comp.start()
  File "/lustre/atlas2/csc230/proj-shared/mskcc/rp_experiments/venv/lib/python2.7/site-packages/radical/utils/process.py", line 462, in start
    timeout))
RuntimeError: unexpected child message (alive) [5.0]

And from root message:

PBS Job Id: 3458802
Job Name:   pilot.0000
Exec host:  2342/0-15
An error has occurred processing your job, see below.
request to copy stageout files failed on node '2342' for job 3458802

Unable to copy file /autofs/nccs-svm1_home1/jdakka/3458802.OU to /lustre/atlas/scratch/jdakka/csc230/radical.pilot.sandbox/rp.session.titan-ext4.jdakka.017339.0000/pilot.0000/bootstrap_1.out, error 1
*** error from copy
/bin/cp: cannot stat `/autofs/nccs-svm1_home1/jdakka/3458802.OU': No such file or directory
*** end error output

Unlink of stage out file /autofs/nccs-svm1_home1/jdakka/3458802.ER failed
相关推荐
最新评论 (20)
aa919 2017-10-9
1

我也有同样的问题我试图运行的实验, 但单位的文件夹没有复制的光泽。我也试图删除飞行员的 virtualenv, 但 RP 是不能创建一个新的。昨天泰坦有一个停机时间, 可能他们更新了系统和更新是困扰 RP。

原文:

I have the same problem. I tried to run experiments but the folders of the units are not copied on Lustre. I also tried to delete pilot's virtualenv but RP is not able to create a new one. Yesterday Titan had a downtime, probably they updated the system and the update is bothering RP.

vivekbala 2017-10-9
2

这可能不是问题的根本原因。但我认为你需要至少2节点 (= 32 核土卫六), 如果你使用 orterun/ortelib。

原文:

This might not be the root cause of the problem. But I think you need a minimum of 2 nodes (=32 cores on Titan) if you are using orterun/ortelib.

aa919 2017-10-9
3

是的, 但我要求额外的节点, 问题仍然存在。

原文:

Yes but I asked for the extra node and the problem is still there.

vivekbala 2017-10-9
4

好吧, 只是想提一下, 因为我注意到 Jumana 只使用了16内核。

原文:

Okk, just wanted to mention since I noticed Jumana is using only 16 cores.

jdakka 2017-10-9
5

运行 Ortelib 与32核心... 同样的拖延问题@AA919

原文:

Ran Ortelib with 32 cores...same stalling issue as @AA919

andremerzky 2017-10-9
6

是的, RP 对我来说也失败了, 甚至早些时候-我们似乎再次遇到了 Python/Pip 兼容性问题:

原文:

Yeah, RP fails for me, too, earlier even - we seem to hit a Python/Pip compatibility problem again :/

andremerzky 2017-10-9
7

泰坦 FS 的行为非常不一致, 现在, 我无法得到调试运行开始。 对不起, 这要再等一会儿:

原文:

titan FS behaves very inconsistent right now, and I can't get debug runs started. Sorry, but this has to wait a little longer :/

andremerzky 2017-10-9
8

因此, 我们在先导引导程序中使用的 virtualenv 版本不再与泰坦的 python 设置合作。 更具体地说, 由于 setuptools 中的不匹配, python、virtenv 和 easy_install 封装在 virtenv 中并用于安装 pip 的特定组合失败。 我不认为我们可以解决这个在我们的堆栈, 我不认为我们可以问泰坦支持, 因为我们使用我们自己的 virtenv 版本, 他们不会支持这一点。

因此, 我们将不得不触摸引导程序, 并回落到原生泰坦 virtenv/pip 组合。 现在, 我们没有选择我们自己的 virtenv 版本没有任何理由, 我们可能要检查什么其他问题会再次弹出-或者可能不是。

无论哪种方式, 我想我只是说: 敬请关注, 这将需要一段时间..。

原文:

Sooo, the virtualenv version we use in the pilot bootstrapper does not cooperate with Titan's python setup anymore. More specifically, the specific combination of python, virtenv and easy_install packaged in virtenv and used to install pip fails, due to a mismatch in setuptools. I don't think we can fix this in our stack, and I don't think we can ask Titan support, as we use our own virtenv version, and they'll not going to support that.

So, we'll have to touch the bootstrapper and fall back to the native titan virtenv/pip combo. Now, we did not choose our own virtenv version for no reason, and we may have to check what other problems will pop up again - or maybe not.

Either way, I guess I am just saying: stay tuned, this will take a while... :/

andremerzky 2017-10-9
9

#1378修复了问题的一部分。 要想踢进去, 你必须把飞行员的 VE 移到土卫六上。 唉, 还需要更多的工作-系统升级也与 pyzmq 部署, 这是现在无法使用的, 所以这张票保持开放。

原文:

#1378 fixes a part of the problem. For this to kick in, you will have to remove the pilot VE on titan. Alas, more work is needed - the system upgrade also screwed with pyzmq deployment, which is now unusable, so this ticket remains open.

andremerzky 2017-10-9
10

zeromq/pyzmq#946

这似乎没有解决 pyzmq 车轮水平。 我想我们需要安装/编译/维护 zmq 来解决这个问题。

原文:

zeromq/pyzmq#946

This seems unresolved on pyzmq wheel level. I guess we will have to install/compile/maintain zmq to resolve this.

marksantcroos 2017-10-9
11

你可能在你的测试中追逐一个鬼魂... 尝试使用 export LD_PRELOAD=/lib64/librt.so.1

原文:

You are chasing a ghost in your test probably ... try with export LD_PRELOAD=/lib64/librt.so.1.

andremerzky 2017-10-9
12
原文:
marksantcroos 2017-10-9
13

如果你通过邮件回复我想你没试过?:)

原文:

Given that you reply by mail I'm assuming you didn't try? :)

andremerzky 2017-10-9
14
原文:
marksantcroos 2017-10-9
15

我可以重现您的错误, 如果我取消了 LD_PRELOAD。

原文:

I can reproduce your error if I unset LD_PRELOAD.

andremerzky 2017-10-9
16
merzky1@titan-ext1:~/sandbox $ env | grep PREL
LD_PRELOAD=/lib64/librt.so.1

merzky1@titan-ext1:~/sandbox $ module list 2>&1 | grep python
 26) python/2.7.9
 27) python_setuptools/21.0
 28) python_pip/8.1.2
 29) python_virtualenv/12.0.7

merzky1@titan-ext1:~/sandbox $ virtualenv ve_test 2>&1 > /dev/null
merzky1@titan-ext1:~/sandbox $ source ve_test/bin/activate

(ve_test)merzky1@titan-ext1:~/sandbox $ which python pip
/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/bin/python
/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/bin/pip

(ve_test)merzky1@titan-ext1:~/sandbox $ pip install pyzmq                                                                                                                                                         
Collecting pyzmq
  Using cached pyzmq-16.0.2.tar.gz
Installing collected packages: pyzmq
  Running setup.py install for pyzmq ... done
Successfully installed pyzmq-16.0.2
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

(ve_test)merzky1@titan-ext1:~/sandbox $ python -c 'import zmq'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/__init__.py", line 34, in <module>
    from zmq import backend
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/__init__.py", line 40, in <module>
    reraise(*exc_info)
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/__init__.py", line 27, in <module>
    _ns = select_backend(first)
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/select.py", line 26, in select_backend
    mod = __import__(name, fromlist=public_api)
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/cython/__init__.py", line 6, in <module>
    from . import (constants, error, message, context,
ImportError: /lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/cython/error.so: undefined symbol: zmq_strerror

(ve_test)merzky1@titan-ext1:~/sandbox $ env | grep PREL
LD_PRELOAD=/lib64/librt.so.1

我是否正确地读了你的评论, 你仍然可以在泰坦上运行的东西? 或者, 换句话说, 您是否无法重现没有LD_PRELOAD?

原文:
merzky1@titan-ext1:~/sandbox $ env | grep PREL
LD_PRELOAD=/lib64/librt.so.1

merzky1@titan-ext1:~/sandbox $ module list 2>&1 | grep python
 26) python/2.7.9
 27) python_setuptools/21.0
 28) python_pip/8.1.2
 29) python_virtualenv/12.0.7

merzky1@titan-ext1:~/sandbox $ virtualenv ve_test 2>&1 > /dev/null
merzky1@titan-ext1:~/sandbox $ source ve_test/bin/activate

(ve_test)merzky1@titan-ext1:~/sandbox $ which python pip
/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/bin/python
/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/bin/pip

(ve_test)merzky1@titan-ext1:~/sandbox $ pip install pyzmq                                                                                                                                                         
Collecting pyzmq
  Using cached pyzmq-16.0.2.tar.gz
Installing collected packages: pyzmq
  Running setup.py install for pyzmq ... done
Successfully installed pyzmq-16.0.2
You are using pip version 8.1.2, however version 9.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

(ve_test)merzky1@titan-ext1:~/sandbox $ python -c 'import zmq'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/__init__.py", line 34, in <module>
    from zmq import backend
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/__init__.py", line 40, in <module>
    reraise(*exc_info)
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/__init__.py", line 27, in <module>
    _ns = select_backend(first)
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/select.py", line 26, in select_backend
    mod = __import__(name, fromlist=public_api)
  File "/lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/cython/__init__.py", line 6, in <module>
    from . import (constants, error, message, context,
ImportError: /lustre/atlas2/csc230/scratch/merzky1/radical.pilot.sandbox/ve_test/lib/python2.7/site-packages/zmq/backend/cython/error.so: undefined symbol: zmq_strerror

(ve_test)merzky1@titan-ext1:~/sandbox $ env | grep PREL
LD_PRELOAD=/lib64/librt.so.1

Am I reading your comment correctly that you still can run stuff on titan? Or, in other words, were you unable to reproduce without LD_PRELOAD?

andremerzky 2017-10-9
17

结果表明, pip 或 pyzmq 中的某些更新会导致在安装 pip 时使用不同的 "环境保护"。 我决定不进一步追查这一点, 而是创建了一个静态 VE 在土卫六使用自定义 pyzmq 安装和功能为我们的目的。 在 fix/titan_pip 分支中更新了相应的配置更改-如果有人可以确认这不仅是我的工作 (权限、环境保护等), 然后可以将其合并到研发中, 我将不胜感激。

原文:

It turned out that some update in either pip or pyzmq lead to a different env being used when installing pip. I decided not to track this down any further, but instead created a static VE on titan which uses a custom pyzmq installation and is functional for our purposes. The respective config changes are updated in the fix/titan_pip branch - I would appreciate if somebody could confirm this works for not just me (permissions, env etc), and then can merge this to devel.

vivekbala 2017-10-9
18

我没有检查, 如果我可以重现的初始问题, 但我尝试了修复/titan_pip 分支, 我的脚本运行和终止目前成功。请注意, 这只测试 aprun 模式。

原文:

I didn't check if I can reproduce the initial issue but I tried the fix/titan_pip branch and my scripts run and terminate successfully currently. Note that this only tests the aprun mode.

andremerzky 2017-10-9
19

谢谢你的反馈, Vivek 我会把它合并现在-在土卫六的肯定是打破了, 所以我们不能让它更糟;)

原文:

Thanks for the feedback, Vivek. I'll be merging it now - devel for sure is broken on titan, so we can't make it worse ;)

aa919 2017-10-9
20

这就是精神:)

原文:

That's the spirit :)

返回
发表文章
jdakka
文章数
1
评论数
1
注册排名
60792