Rocks reinstall on compute node, after disk rebuild, fails

https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2012-January/056284.html

Rocks reinstall on compute node, after disk rebuild, fails

Bart Brashers bbrashers at environcorp.com
Mon Jan 23 13:36:27 PST 2012

I think on key tidbit of information would help you here.  You only run insert-ethers when you are trying to install a NEW appliance (e.g. compute node).  If you just want to re-install an existing node, you do not use insert-ethers.

Insert-ethers is a tool to intercept PXE boots, detect MAC addresses, and create entries in the rocks database for new nodes.  It does have a few (leftover) features like removing database entries, but those have been largely moved to "/opt/rocks/bin/rocks" commands.

So in your case, to rename compute-0-4 to compute-0-0, you can do one of two things:

Version one:

1. Run "insert-ethers --replace compute-0-4"
2. Pick "compute" from list.
3. Make that node PXE boot.
4. When the "( )" turns into a "(*)", indicating that the node has received a kickstart file, exit.

Version two:

1. Run "rocks remove host compute-0-4"
2. Run "rocks sync config; rocks sync users"
3. Run "insert-ethers --cabinet 0 --rank 0"
4. Pick "compute" from list.
5. Make that node PXE boot.
6. When the "( )" turns into a "(*)", indicating that the node has received a kickstart file, exit.

To re-install a node that already has an entry in the rocks database (i.e. a "known" node):

1. Run "rocks set host boot compute-0-0 action=install
2. Make that node PXE boot.

When the node is either installed or re-installed, the OS is completely new.  This includes things like the node's SSH ID (values you would put in ~/.ssh/known_hosts or /etc/ssh/ssh_known_hosts).

But you can still currently do a "ssh compute-0-4", and after it's been installed as compute-0-0, you can "ssh compute-0-0".

I think you can detect a hardware problem by reading `dmesg` and/or looking in the logs.  No need to do a filesystem check (fsck) as suggested by Luca.  If you really want to be sure, you can boot your FE from a LiveCD, and run "fsck" on each partition of your FE.  Keep in mind that if you have some large userdata section, it could take a really, really long time to run.  For now, run fsck only on the /, /var, /boot, etc. partitions.  I can't give you an exact list, because I don't know how you partitioned your FE.  But you get the idea: check the local OS partitions.

I suspect that most of your problems stem from running insert-ethers too many times when you should have used "rocks set host boot compute-0-0 action=install".

Bart
  • 发表于 2020-12-29 13:01
  • 阅读 ( 1648 )
  • 分类:linux

0 条评论

请先 登录 后评论
omicsgene
omicsgene

生物信息

657 篇文章

作家榜 »

  1. omicsgene 657 文章
  2. 安生水 327 文章
  3. Daitoue 167 文章
  4. 生物女学霸 120 文章
  5. 红橙子 78 文章
  6. CORNERSTONE 72 文章
  7. rzx 67 文章
  8. xun 66 文章