With the growing performance requirements for cloud services, ITGLOBAL.COM was faced with the need to migrate Serverspace customers from an outdated cluster to a new site. The decision was made to design and launch a new high-performance cluster that will provide higher virtual machine density, lower latency, and the ability to scale further.
This article describes ITGLOBAL.COM’s experience in selecting hardware, building a vStack HCP-based architecture, migrating virtual infrastructure, and testing the new cluster’s fault tolerance.
Project profile
Company name: Serverspace
Industry: Cloud infrastructure and virtualization
ITGLOBAL.COM’s role: Design, implementation, and operation of infrastructure
Project objectives
The previous cluster, built on the second-generation Intel Xeon Scalable processors, had been stably serving thousands of virtual machines for several years. However, the analysis of operational metrics revealed limitations for further development:
– the growth of customer requirements for IOPS, latency, and network bandwidth;
– the approach of the disk subsystem response time to the limit values under peak loads;
– limited ability to scale and increase the density of virtual machines;-
– the need to improve performance for databases and high-load web applications.
To address these challenges, it was necessary to create a new infrastructure platform with higher performance, fault tolerance, and efficient resource utilization.
Solution from ITGLOBAL.COM
Hardware platform selection
Servers ITPOD-SL201-D24R-NV-G4 were chosen as the basis for the new cluster, the characteristics of which met the requirements for performance and scalability.
Configuration of each node:
Processors: 2 × Intel Xeon 6526Y (Scalable Gen5)
Memory: 16 × 64 GB RDIMM 5600 MHz (1 TB RAM per node)
System disks: 2 × 960 GB U.2 NVMe
Data disks: 7 × 1920 GB U.2 NVMe
Network: 2 × dual-port 25Gb Ethernet SFP28 (OCP 3.0)
A key feature of the platform is the direct connection of NVMe disks to the motherboard without intermediate controllers, which minimizes delays and improves the performance of the software-defined storage layer.
vStack HCP-based architecture
The cluster is based on the vStack HCP platform, which combines computing resources, data storage, and network infrastructure into a single system. This approach allows you to work with a shared pool of resources without separating them into separate storage and network components.
– computing resources are provided by ITPOD servers;
– NVMe disks ensure stable operation of applications under high load;
– built-in vStack HCP network architecture simplifies management of routing and traffic balancing;
– the system maintains performance even with a sharp increase in load.
Virtual infrastructure migration
After the cluster deployment in IXcellerate, the virtual infrastructure of Serverspace was migrated. The virtual machines were transferred:
– between different virtualization environments;
– between vStack clusters.
The Infrastructure as Code (IaC) approach was used for data-level migration. An identical information system was deployed in the target environment without user data, after which data was copied and applications were launched already on the new site.
We also used a phased withdrawal of nodes from the original infrastructure after integrating new servers, which allowed us to migrate without stopping the services.
Testing and Fault Tolerance
The cluster architecture was initially designed to handle single-node failures. ITGLOBAL.COM conducted a series of load and crash tests.
Scenario 1: Single-Node Failure
Under active load (100 MB/s write, 300 MB/s read), one server was shut down:
– after 8 seconds, fencing triggered, and the node was removed from the cluster;
– the virtual machines automatically restarted on other nodes;
– degradation of SDS performance was recorded.
Scenario 2: Sequential failure of two nodes
After the first node failed, the second server was disabled:
– the system correctly handled the double failure;
– the availability of services and data was preserved;
– the performance decrease was less than 15% during the rebalancing;
– after ~20 minutes, the performance recovered to 100%.
Network architecture
25-gigabit network adapters are used for the cluster, as in the vStack HCP architecture the following passes through Ethernet:
– client traffic;
– storage traffic;
– virtual machine migration traffic;
– platform management traffic.
Two network adapters connected to different CPU socket provide fault tolerance and optimal load distribution taking into account the NUMA architecture.
Results
As a result of the launch of the new cluster, ITGLOBAL.COM achieved the following results:
– the performance of the disk subsystem increased by 3 times;
– the number of CPU cores per node increased by 2 times with the same power consumption;
– the cost of ownership per virtual machine decreased by 35%;
– fault tolerance was confirmed even with multiple failures;
– the cost of the solution increased by only 15% with a 3-fold increase in performance.
Overcommit support allowed us to increase the density of virtual machines and use resources more efficiently without expanding the physical infrastructure.
Development plans
The next stage of architecture development will be the addition of a classic ITPOD Storage with HDD disks to implement tiered storage:
– NVMe — for hot data;
– HDD — for cold data.
This will allow to support backup scenarios, S3-compatible storage, file shares and Dev/Test environments while maintaining single management through vStack HCP.