ATOM Cluster

This page is obsolete. Please visit newer page from HERE.

Three chassis of NEC's ATOM servers have arrived in April 2014. This page includes the information about these setup and initial performance evaluation.

Table of Contents)

  1. RAMCloud in a Box (Prototype 1) at Stanford
    1. SEDCL Forum 2014 Presentation: "RAMCloud in a Box": pdf
    2. Sever Overview:
      1. Three chassis of NEC's Micro Modular Server (44 ATOM servers in 2U chassis) : 132 ATOM servers : announced May 2014
    3. Management Server:
  2. Presentations:
    1. Poster session at SEDCL retreat June 2014 - RAMCloud performance with userland driver on ATOM server
    2. Poster session at SEDCL forum January 2014 - Overview of ATOM server
  3. Current Status
  4. Setup
    1. Terminologies
    2. Server Hardware Setup
    3. rcmaster (host) Setup
    4. Install and boot CentOS on ATOM Servers
    5. Rebuilding RAMCloud and run it on ATOM Servers
  5. Performance evaluation:
    1. Ping with kernel/tcp
    2. RAMCloud performance:
      1. Clusterperf performance with kernel/tcp (tuned)
      2. Clusterperf with userland (kernel bypass driver) (Tuning is still going on)
  6. References:

1. Server Overview)

 

  1. The sever announcement
    1. NEC America: http://www.nec-enterprise.com/News/Latest-press-releases/NEC-raises-the-bar-for-high-density-IT-solution-platforms-for-the-public-and-private-cloud-698
    2. Article: http://www.otcmarkets.com/news/otc-market-headline?id=16150276
    3. ASCII.jp (In Japanese, 日本語): http://ascii.jp/elem/000/000/895/895712/
  2. (Photo) Three chassis installed in Stanford Server Room: (two chassis installed on top of existing rcmaster rack)

2. Presentations)

  1. SEDCL Retreat Poster Session on June 5, 2014:
    1. pdf: satoshi_poster.pdf
    2. Draft - multiple pages, easier to see) rev1.01 on June 4, 2014
      1. pdf: 20140605-RAMCloudOnMicroServer_r1_01s.pdf
      2. ppt: 20140605-RAMCloudOnMicroServer_r1_01s.ppt
      3. Mac keynote source: 20140605-RAMCloudOnMicroServer_r1_01s.key.zip (zip compressed)
      4. pdf with appendix: 20140605-RAMCloudOnMicroServer_r1_01.pdf
  2. Poster session at SEDCL forum on January 2014 - Overview of ATOM server ... Included in the introduction of June 5, 2014 presentation.

5. Current Status)

  • At Stanford)

    • 88 Atom servers are running on CentOS 6.5.
    • To setup, NIS, NFS mount, DNS server to ssh login
    • To compile RAMCloud and provide python scripts for testing RAMCloud.
  • At NEC Japan)

    • Performance improvement)
      1. Without replica/backup) - Single thread mode.
        1. 100B read (30B key): 12.4 us (min. 11.6 us) (7.3MB/s) :  was 13.8us (6.9MB/s) at SEDCL retreat presentation.
        2. 100B write (30B key): 15.3 us (6.2MB/s) :  was 18.2us (5.2MB/s) at SEDCL retreat presentation.
        3. Note) Spent 8.7 us in Intel igb driver. Contacting Intel for investigation and improvement.
      2. With replica/backup) - Need to enable multithread to respond backup request to collocated backup service while responding write request on master.
        1. 100B read (30B key): 13.2 us (min. 12.6 us) (7.2MB/s)    vs.   5.1us (18.7MB/s( RAMCloud with 32Gbps Infiniband (See: clusterperf August 12, 2013 )
        2. 100B write (30B key): 38.0 us (2.5MB/s)    vs. 15.7us (6.1MB/s)  vs. RAMCloud with 32Gbps Infiniband
        3. The same performance both on 1.7GHz ATOM Avoton (C2730) and 2.4GHz ATOM (C2750) - Time spent in L2 cache/main memory access or network, etc not running at core clock.
    • Further evaluation
      • Preparing performance evaluation on the latest Xeon machine on 10G NIC with the same Kernel bypass driver.
    • Additional host for remote maintenance and upgrading RAMCloud.

4. Setup)

  Terminologies)

  1. Management components:
    1. One instance in each server blade:
      1. MMC: each server's BMC (Base Management Controller) which controls power, boot, etc
    2. For Chassis control:
      1. CMM: a BMC which controls the chassis function. Two instances in a chassis. One as master, one as slave (slave).
      2. ONS:  a control port for each of two switch boards in a chassis. Two instances for two network cards. We can access ONS with ether two microUSB ports on the front panel.

  Server HW Setup)

  1. Setup Document: 20140806-ATOMServerSetup_r1_00.pdf
  2. IPaddress assignment:
    1. Host IP assignment list – with information about chassis, slot, MAC address
    2. Raw Data) Mac addresses and slot information.
  3. Configuration scripts on rcmaster: 
    1. Note)
      1. We strongly recommend to use hostname instead of IP-address because physical configuration since IP-address assignment would be changed anytime. IPMItool accepts hostname or IP-address.
      2. IPMItool password is read from '~/.atom/ipmi_password.txt'. You should restrict the access permission of the file to limited administrators.
    2. AllCheck.sh <CMM-Master>  // Acquires the 'Mac address and slot information' in above item2.
      1. try.sh // try ping and AllCheck.sh for a range of IP-addresses.
    3. NoVLAN.sh <ether_CMM> //  First step to merge VLAN=4092 to VLAN=1 to all the MMCs in the LAN managed by CMM. It needs to be run before reconfigure ONS, otherwise we become unable to talk to MMCs on the LAN.
    4. ipmiaw  <range_of_servers> <ipmi_blobs> : ipmi atom wrapper : you can use 1,2,3-10  to specify range of MMC of atom servers. Dry run, check generated command with '-d' for the first paramer. 'Range' is extended from original ipmirw:
      1. range elements: 
        1. M : Single element
        2. M-N : M to N
        3. M+D : M to M+D-1  (D elements)
      2. Range element can be ether:
        1. Hostname: atom002,   or number
        2. IP-address: 192.168.5.3, If upper 24 bit is omitted like '.2', default subnet for ATOM IPMI is used.
      3. Concatenation of range elements by ,  (Note: No space in the list!!)
        1. Eg.)  1,5-7,10+3,20,25-27,50+2,60             Try)   ipmiaw -d 1,5-7,10+3,20 foo bar
        2. Eg)   .1,5-7,10+3     or 192.168.5.1+44       Try)   ipmiaw -d .1,5-7,10+3,20 foo bar
    5. The tools below calls ipmiaw wrarpper:
      1. pxeboot.sh <range_of_servers>     // Run PXE boot at next boot. Once a PXE boot is performed, the boot image is automatically saved in the server's local SSD and next boot is booted from local copy.
      2. atom_up.sh <range_of_servers>      // power up range of servers
      3. atom_down.sh <range_of_servers> // power down range of servers
      4. atom_boot.sh <range_of_servers>   // boot OS on the range of servers. 
                  // Special key sequence: `~.` to quit, `~^z` to suspend. With ssh login, '~' needs to be escaped with `~', so `~~.' to quit. 
  4. Configuration commands through serial terminal connected to ONS through microUSB port in TeraTerm  command script format.
    1. noVLAN.ttl  // Second step of VLAN reconfiguration to ONS. Including initONS.ttl sequence.
    2. initONS.ttl   //  Initialize ONS, serial terminal command sequence to reconfigure the LAN switch managed by the ONS.
    3. lan40G.ttl    // Reconfigure external switch port to 40Gx1 LAN from default 10Gx4 LAN mode. Effective for the LAN controlled by the ONS.
    4. lan10G.ttl    // vise versa.
  5. For power up/down, activate (boot OS), see the latter half of section " Install and boot CentOS on ATOM servers)".

   References:

  rcmaster Setup)

       1. DHCP setup

hostname assignment proposal)
    Precaution) As we have IP-address holes 'xx.xx.xx.0 and xx.xx.xx.255', it is not easy to associate numbers included in host name to some numbers in IP-address. I think we should hide IP-address and use hostname instead to improve flexibility to modification of system configuration.

1. Kernel driver ports, which normally used for host communication, connected through NIC1 (eth0 : 'InternalMAC1' in the ATOM Server Mac Address table):

 chassis1 (root)chassis2 (leaf1)chassis3 (leaf2)
eth0 (NIC1) portatom001 to 044atom045 to 088atom089 to 132

2. MMC (BIOS port for server management: 'MAC Addresss' in the  ATOM Server Mac Address table):

 chassis1 (root)chassis2 (leaf1)chassis3 (leaf2)
MMC portatom001m to 044matom045m to 088matom089m to 132m

3. Userland driver ports (eth1 : 'InternalMAC2'  in the  ATOM Server Mac Address table)

Assign no IP-address to avoid loop, which is OK because the ports are only with L2 (MAC address).

4. Control ports for each chassis. We refer chassis with atom server's development code name 'mercury'.

 chassis1 (root)chassis2 (leaf1)chassis3 (leaf2)
CMM(Master) portmercury1cmmmercury2cmmmercury3cmm
CMM(Slave) portmercury1cmsmercury2cmsmercury3cms
ONS(Master) portmercury1onmmercury2onmmercury3onm
ONS(Slave) portmercury1onsmercury2onsmercury3ons

5. If we modify userland driver to use L3 for future extension, we can name them with suffix from 'a'. We think it is enough to reserve 'a' to 'e'. It does not go through 'm' which is assigned for MMC.

 chassis1 (root)chassis2 (leaf1)chassis3 (leaf2)
eth1 (NIC2) portatom001a to 044aatom045a to 088aatom089a to 132a
eth2 (NIC3) ... if extendedatom001b to 044batom045b to 088batom089b to 132b
::::

     1.1 Using Multiple Subnet)

Current ramcloud cluster has 80 servers. We are going to assign host/DHCP to  subnets as follows:

Both OS port and IPMIport of RAMCloud servers including rcmaster, rcnfs, rctest, rcmonster

 subnetnamenumber of IP addressesusage
curent192.168.0/24rc**176Both OS port and IPMI port on each server
curent192.168.1/24infiniband88IP address for Infiniband
curent192.168.2/24inf eth8810G Ethernet on Infiniband card
new192.168.3/24atom***132OS port (eth0) of ATOM server
new192.168.4/24atom***a132Eth1 of ATOM server (for future use)
new192.168.5/25atom management132+12MMC of servers, CMMs and ONS of chassis

     2. Setup tftp server for PXE boot)

  Install and boot CentOS on ATOM servers)

 

 ramcloud clusterATOM cluster
OSRHEL 6.0 v2.6.3.2-71CentOS 6.5 v2.6.3.2
gcc4.4.74.4.7

 

  1. Creating or downloading CentOS image
  2. Partitioning SSD
    1. Block device '/dev/sda2' is used for RAMCloud backup space. More than 100GB space expected.
  3. Locating boot image for PXE 
    1. Base directory is  /tftpboot/images/centos/x86_64/6.5
    2. Modification needed
      1. Disable SELinux:

        #vi /etc/selinux/config
         SELINUX=disabled
        #setenforce 0
      2. Update igb driver to 5.1.2 from igb-5.0.5-k :

        cd ..../source/igb/igb/src
        cp igb.ko /lib/modules/2.6.32-431.el6.x86_64/kernel/drivers/net/igb/igb.ko
        rmmod igb
        modprobe igb RSS=8 InterruptThrottleRate=1
  4. Insert two 200V power cables to each chassis. (No power switch on ATOM server chassis).
  5. Enable PXE boot on the respective ATOM sever with IPMItool ...
    If this procedure is skipped and any PXE boot is performed before, OS is booted with the server’s local copy.
         $ pxeboot.sh <atomXXXm>  // See  ’Server HW Setup)' --> 'Configuration Scripts'  for the command reference.
  6. Power up and boot the respective server with IPMItool: >> See: While Chassis startup/shutdown shortcut.
         $ ipmitool -I lanplus -U <admin_user> -P <password> -H atomXXXm power on 
  7. Can connect OS console on a server with ipmi or ssh.
         $ ipmitool -I lanplus -U <admin_user> -P <password> -H atomXXXm sol activate
    Note) Special ipmitool key sequence: '~.` to exit, `~^z` to suspend. If you are logged in with ssh, you need to escape '~'  with '~', so type '~~.' to exit. 

System Shutdown) - Skip the procedure if the corresponding power up sequence is not executed.

  1. Shutdown OS
          $ ssh root@atomXXXm shutdown -h now  
  2. Power down each server
      $ ipmitool -I lanplus -U <admin_user> -P <password> -H atomXXXm power off    
  3. Remove the power cables 
     Sample script is provided in '<ConfigurationScriptDirectory>/PowerCtrlExample/{up, down, nec*}.sh   (nec*.sh for activation).

Short Cut for whole chassis nodes)

  1. Start all servers in the chassis)
    $ ipmiaw mercury?cmm power on   // Boot whole severs by sending 'power on' to the CMM of the chassis.
  2. Shutdown and power off all servers in the chassis)
    $ipmiaw mercury?cmm power soft  // power soft waits the shutdown of OS. Sending 'power soft' to CMM initiates OS shutdown of all servers in the chassis, wait the OS shutdown and shut down the power of servers. Do not use 'power off' while OS is running, it initiates OS shutdown but it force power down in four seconds regardless of OS status (It is used to force power down.).

  Build RAMCloud and run it)

  1. Allocate a dedicated workspace for ATOM cluster on rcmaster
    1. It will be merged to existing ramcloud work tree, while the source tree is merged to existing ramcloud source repository.
  2. Compiling RAMCloud for ATOM server
  3. Compiling DPDK module
    1. link or Insmod
  4. Run clusterperf.py

References:

5. Performance evaluation)

  • Results on May 27, 2014. Summarized in SEDCL retreat presentations on June, 2014.
  • Still working and improving to 11.5us with 100B read (on Aug. 5, 2014).
  1. Peak Performance Calculation Sheet: ATOMServerPeakPerformances.xls
    1. Clusterperf.py 100B read with kernel tcp – 67.8 us
  2. with userland (kernel bypass) driver – tentative  through 1 hop LAN switch (FM5224 chip)
    1. ping  : 7 us
    2. clusterperf.py (lists in next section) : average and best/worst in 100ms trial period. (7,000 samples in 100B read)
      1. with cut through switch mode.
      2. with store and forward switch mode  - almost the same as cut through mode.

Clusterperf.py with cut through switch mode.

basic.read100 13.8 us, Best 13.3 us , Worst 32.2 us
  6.9 MB/s bandwidth reading 100B object with 30B key
basic.read1K 17.9 us, Best 17.3 us , Worst 29.0 us
  53.4 MB/s bandwidth reading 1KB object with 30B key
basic.read10K 48.6 us, Best 47.8 us, Worst 55.9 us
 196.4 MB/s bandwidth reading 10KB object with 30B key
basic.read100K 369.0 us , Best 367.3 us , Worst 376.1 us
  258.4 MB/s bandwidth reading 100KB object with 30B key
basic.read1M 3.8 ms, Best 3.8 ms , Worst 3.8 ms
  251.4 MB/s bandwidth reading 1MB object with 30B key

basic.write100 18.1 us, Best 17.4 us, Worst 35.2 us
  5.3 MB/s bandwidth writing 100B object with 30B key
basic.write1K 22.7 us, Best 21.8 us, Worst 120.8 us
  42.0 MB/s bandwidth writing 1KB object with 30B key
basic.write10K 60.1 us, Best 58.2 us,Worst 100.3 us
  158.6 MB/s bandwidth writing 10KB object with 30B key
basic.write100K 428.3 us, Best 418.9 us,  Worst 470.9 us
  222.7 MB/s bandwidth writing 100KB object with 30B key
basic.write1M 4.6 ms, Best 4.5 ms , Worst 4.7 ms
  206.8 MB/s bandwidth writing 1MB object with 30B key

Clusterperf.py with store and forward switch mode.

basic.read100 13.8 us,  Best 13.3 us, Worst 32.7 us
  6.9 MB/s bandwidth reading 100B object with 30B key
basic.read1K 20.7 us,  Best 20.0 us, Worst 37.7 us
  46.1 MB/s bandwidth reading 1KB object with 30B key
basic.read10K 52.8 us, Best 52.1 us,  Worst 68.6 us
 180.8 MB/s bandwidth reading 10KB object with 30B key
basic.read100K 373.2 us, Best 371.3 us Worst 379.0 us
 255.5 MB/s bandwidth reading 100KB object with 30B key
basic.read1M 3.9 ms, Best 3.8 ms,  Worst 3.9 ms
 247.2 MB/s bandwidth reading 1MB object with 30B key

basic.write100 18.2 us, Best 17.4 us,  Worst 43.6 us
  5.2 MB/s bandwidth writing 100B object with 30B key
basic.write1K 25.6 us, Best 24.7 us, Worst 64.1 us
  37.2 MB/s bandwidth writing 1KB object with 30B key
basic.write10K 64.2 us, Best 62.5 us, Worst 95.5 us
  148.6 MB/s bandwidth writing 10KB object with 30B key
basic.write100K 431.4 us, Best 423.2 us, Worst 463.0 us
  221.0 MB/s bandwidth writing 100KB object with 30B key
basic.write1M 4.7 ms, Best 4.6 ms,  Worst 4.8 ms
  204.6 MB/s bandwidth writing 1MB object with 30B key

 Comparison: Clusterperf.py with 32Gbps Infiniband on Aug 12, 2012

basic.read100          5.1 us     read single 100B object with 30B key
basic.readBw100       18.7 MB/s   bandwidth reading 100B object with 30B key
basic.read1K           6.9 us     read single 1KB object with 30B key
basic.readBw1K       137.6 MB/s   bandwidth reading 1KB object with 30B key
basic.read10K         10.4 us     read single 10KB object with 30B key
basic.readBw10K      914.1 MB/s   bandwidth reading 10KB object with 30B key
basic.read100K        47.2 us     read single 100KB object with 30B key
basic.readBw100K       2.0 GB/s   bandwidth reading 100KB object with 30B key
basic.read1M         420.8 us     read single 1MB object with 30B key
basic.readBw1M         2.2 GB/s   bandwidth reading 1MB object with 30B key

Performance comparison with clusterperf.py)

RAMCloud has ported to ATOM server in the middle of April 2014.
Environment and difference)

Using same Linux kernel and gcc.

 ramcloud clusterATOM cluster
OSRHEL 6.0 v2.6.3.2-71CentOS 6.5 v2.6.3.2
gcc4.4.74.4.7
Initial performance evaluation on tcp)

Updated on May 22, 2014, user mode driver is under development for ATOM servers.
Compiled with 'make DEBUG=no'

Reference: clusterperf on TCP (about --disjunct option and process/port assignment of clusterperf+tcp)

 transportreplicasize
(B) 
read
(us)
write
(us) 
clusterperf.py option
ramcloud clustertcpno10025.125.1

--verbose --transport=tcp --clients=1 --servers=1 --numBackups=0 --replicas=0 --disjunct basic

100K103145
tcp110024.0102.1

--verbose --transport=tcp --clients=1 --servers=2 --numBackups=1 --replicas=1 --disjunct basic

InfRcno1005.26.2

--verbose --clients=1 --servers=1 --numBackups=0 -- replicas=0 --disjunct basic

ATOM clustertcpno10070 

--verbose --transport=tcp --server=1 --client=1 --numBackups=0 --replicas=0 --disjunct basic

100K  

Ping Comparison) with kernel TCP driver

 ping (us)condition

ramcloud
cluster 

220ping to rc01 on rcmaster
ATOM
cluster 
120 to 150 

 

6. Reference)

  1. Existing switch 1G x 48 port switch: HP ProCurve 2510G-48: http://h17007.www1.hp.com/us/en/networking/products/switches/HP_2510_Switch_Series/index.aspx#.U-OXKYBdUqk
  2. New switch 1G x  port: HP 2920-24G: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04111401
  3. Information about ATOM NIC (Intel Avoton C2xxx) and TOR board (FM5224 chip) in the chassis: SF13_CLDS006_101.pdf
    (Downloaded from Intel IDF2013 presentation: https://intel.activeevents.com/sf13/connect/fileDownload/session/A02B7458AF93EB0153BB728308E30F99/SF13_CLDS006_101.pdf