!!! Nomad

[{TableOfContents }]


!! Resources

* [Project home|https://www.nomadproject.io]
* [aws related stuff in gitblit|http://www.computerhok.nl/gitblit/tree/~metskem!aws-userdata.git]

!! Miscellaneous commands 

! nomad server-members
{{{
[root@ip-172-31-22-13 logs]# nomad server-members
Name                                                  Addr          Port  Status  Proto  Build  DC    Region
ip-172-31-22-13.eu-central-1.compute.internal.global  172.31.22.13  4648  alive   2      0.2.3  Best  global
ip-172-31-29-38.global                                172.31.29.38  4648  alive   2      0.2.3  Best  global
ip-172-31-29-39.global                                172.31.29.39  4648  alive   2      0.2.3  Best  global
}}}

! nomad node-status
{{{
[root@ip-172-31-22-13 logs]# nomad node-status
ID                                    DC      Name              Class   Drain  Status
0030d4f7-c0ae-5f48-3c01-3939eade8a42  Boxtel  ip-172-31-18-153  <none>  false  ready
8718fa79-f468-9bd8-117b-ce87403920b4  Best    ip-172-31-30-25   <none>  false  ready
a90f5ffb-334a-b9f7-105f-ce078b181098  Best    ip-172-31-30-24   <none>  true   down
ccb36d24-4681-326c-b08e-42aa56a4bf67  Best    ip-172-31-29-208  <none>  false  ready
}}}

! nomad validate
{{{
[ec2-user@ip-172-31-22-13 jobs]$ nomad validate SimpleHTTPServer.nomad 
Job validation successful
}}}

! nomad run
{{{
[ec2-user@ip-172-31-22-13 jobs]$ nomad run SimpleHTTPServer.nomad 
==> Monitoring evaluation "edfc16cd-0a57-edaf-472b-a42ea346407f"
    Evaluation triggered by job "SimpleHTTPServer"
    Allocation "a51dec6c-7416-fc09-037c-6cc617136209" created: node "0030d4f7-c0ae-5f48-3c01-3939eade8a42", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "edfc16cd-0a57-edaf-472b-a42ea346407f" finished with status "complete"
}}}

! nomad status
{{{
[ec2-user@ip-172-31-22-13 jobs]$ nomad status
ID                Type     Priority  Status
SimpleHTTPServer  service  50        <none>
}}}


{{{
[ec2-user@ip-172-31-22-13 jobs]$ nomad status -short SimpleHTTPServer
ID          = SimpleHTTPServer
Name        = SimpleHTTPServer
Type        = service
Priority    = 50
Datacenters = Boxtel
Status      = <none>
}}}


{{{
[ec2-user@ip-172-31-22-13 jobs]$ nomad status SimpleHTTPServer
ID          = SimpleHTTPServer
Name        = SimpleHTTPServer
Type        = service
Priority    = 50
Datacenters = Boxtel
Status      = <none>

==> Evaluations
ID                                    Priority  TriggeredBy   Status
edfc16cd-0a57-edaf-472b-a42ea346407f  50        job-register  complete

==> Allocations
ID                                    EvalID                                NodeID                                TaskGroup  Desired  Status
a51dec6c-7416-fc09-037c-6cc617136209  edfc16cd-0a57-edaf-472b-a42ea346407f  0030d4f7-c0ae-5f48-3c01-3939eade8a42  cache      run      failed
}}}

! nomad stop

{{{
[ec2-user@ip-172-31-22-13 jobs]$ nomad stop SimpleHTTPServer
==> Monitoring evaluation "0a6b662b-a5e9-9495-1e77-429306da45a7"
    Evaluation triggered by job "SimpleHTTPServer"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "0a6b662b-a5e9-9495-1e77-429306da45a7" finished with status "complete"
[ec2-user@ip-172-31-22-13 jobs]$ nomad status SimpleHTTPServer
Error querying job: Unexpected response code: 404 (job not found)
}}}


!! Errors

When submitting a simple python webserver as a job (exec driver), we get this in the nomad client log:
{{{
* Failed to join spawn-daemon to the cgroup (&{Name:02b1290b-0a2d-43b2-eee8-fcc096e86a14 Parent: ScopePrefix: Resources:0xc8200eef00}): Error found less than 3 fields post '-' in "24 21 0:6 / /home/ec2-user/nomad/data/alloc/54e3a869-e78c-f2cf-368a-9374460bea37/web/dev ro,relatime - devtmpfs  rw,size=500712k,nr_inodes=125178,mode=755"
    2016/01/12 10:19:27 [DEBUG] client: updated allocations at index 519 (2 allocs)
    2016/01/12 10:19:27 [DEBUG] client: allocs: (added 0) (removed 0) (updated 2) (ignore 0)
}}}

Googling it gives you exactly one hit, which is a 404 on github, digging further in that github repo brings you to the source at : [https://github.com/hashicorp/nomad/blob/master/client/driver/executor/exec_linux.go] :-) :-(

I fired up a RHEL7 ami (instead of the Amazon Linux AMI) to see if that helps, but that also gives the same error.

Assuming it has something to do with the driver, I changed the driver from "exec" to "raw_exec", but that gives a "missing drivers" :

{{{
[ec2-user@ip-172-31-22-13 jobs]$ nomad run SimpleHTTPServer.nomad 
==> Monitoring evaluation "306cb11a-b012-c89f-aa36-69c9a0e44464"
    Evaluation triggered by job "SimpleHTTPServer"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "7a0350ea-8396-96f7-308d-1aee4a500619" status "failed" (1/1 nodes filtered)
      * Constraint "missing drivers" filtered 1 nodes
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "306cb11a-b012-c89f-aa36-69c9a0e44464" finished with status "complete"
}}}

Checking out the logs of the nomad client confirms:
{{{
2016/01/12 07:59:19 [DEBUG] client: available drivers [exec java]
}}}


! Looping PID 1 ?!

After running a nomad job like this :
{{{
      driver = "exec"
      config {
        command = "/bin/bash"
        args = [
          "-c",
          "mkdir ff && cd ff && curl --silent --show-error --remote-name 'http://www.computerhok.nl/tmp/dropwizardtest-1.2-assembly.zip' && unzip *.zip && cd * && java -jar dropwizardtest*.jar server helloworld.yaml"]
      }
}}}

This results in a hardly responding host, and just before reboot:
{{{
[root@ip-172-31-19-2 log]# 
Message from syslogd@ip-172-31-19-2 at Jan 17 09:25:51 ...
 kernel:BUG: soft lockup - CPU#0 stuck for 23s! [nomad:9073]
}}}
!! chroot for exec task ?

According to the documentation, an "exec" task is running chrooted. Although we do like chrooted envs, I think looks like this requires quite some diskspace (730 MB for one python cmd) (not sure if that's true with all those mounts here):

{{{
[root@ip-172-31-29-208 alloc]# pwd
/home/ec2-user/nomad/data/alloc

[root@ip-172-31-29-208 alloc]# ls -l
total 8
drwx------ 3 root root 4096 Jan 13 07:05 651023c2-5bcc-9997-8350-c179eb38e73d
drwx------ 4 root root 4096 Jan 13 07:06 c6c34727-2e79-7112-674f-c02fba46209a

[root@ip-172-31-29-208 alloc]# du -cms . 2>/dev/null
731	.
731	total

[root@ip-172-31-29-208 alloc]# ls -l c6c34727-2e79-7112-674f-c02fba46209a/web
total 36
drwxrwxrwx  5 nobody nobody  4096 Jan 13 07:06 alloc
dr-xr-xr-x  2 root   root    4096 Jan 13 07:06 bin
drwxr-xr-x 16 root   root    2720 Jan 13 06:43 dev
drwxr-xr-x 75 root   root    4096 Jan 13 07:06 etc
dr-xr-xr-x  7 root   root    4096 Jan 13 07:06 lib
dr-xr-xr-x 10 root   root   12288 Jan 13 07:06 lib64
drwxrwxrwx  2 nobody nobody  4096 Jan 13 07:06 local
dr-xr-xr-x 86 root   root       0 Jan 13 06:43 proc
dr-xr-xr-x  5 root   root    4096 Jan 13 07:06 usr

[root@ip-172-31-29-208 proc]# mount
mount: /proc/self/mountinfo: parse error: ignore entry at line 9.
mount: /proc/self/mountinfo: parse error: ignore entry at line 10.
mount: /proc/self/mountinfo: parse error: ignore entry at line 12.
mount: /proc/self/mountinfo: parse error: ignore entry at line 13.
mount: /proc/self/mountinfo: parse error: ignore entry at line 15.
mount: /proc/self/mountinfo: parse error: ignore entry at line 16.
proc on /proc type proc (rw,relatime)
sysfs on /sys type sysfs (rw,relatime)
/dev/xvda1 on / type ext4 (rw,noatime,data=ordered)
devtmpfs on /dev type devtmpfs (rw,relatime,size=500712k,nr_inodes=125178,mode=755)
devpts on /dev/pts type devpts (rw,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /dev/shm type tmpfs (rw,relatime)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw,relatime)
/dev/xvda1 on /home/ec2-user/nomad/data/alloc/651023c2-5bcc-9997-8350-c179eb38e73d/web/alloc type ext4 (rw,noatime,data=ordered)
none on /home/ec2-user/nomad/data/alloc/651023c2-5bcc-9997-8350-c179eb38e73d/web/dev
none on /home/ec2-user/nomad/data/alloc/651023c2-5bcc-9997-8350-c179eb38e73d/web/proc
/dev/xvda1 on /home/ec2-user/nomad/data/alloc/c6c34727-2e79-7112-674f-c02fba46209a/web/alloc type ext4 (rw,noatime,data=ordered)
none on /home/ec2-user/nomad/data/alloc/c6c34727-2e79-7112-674f-c02fba46209a/web/dev
none on /home/ec2-user/nomad/data/alloc/c6c34727-2e79-7112-674f-c02fba46209a/web/proc
/dev/xvda1 on /home/ec2-user/nomad/data/alloc/a3cc556f-eafe-4c30-a479-0fd63d8d63fd/fulltest-task/alloc type ext4 (rw,noatime,data=ordered)
none on /home/ec2-user/nomad/data/alloc/a3cc556f-eafe-4c30-a479-0fd63d8d63fd/fulltest-task/dev
none on /home/ec2-user/nomad/data/alloc/a3cc556f-eafe-4c30-a479-0fd63d8d63fd/fulltest-task/proc

}}}