VMware vSAN ESA/MAX Architecture (Deep Dive)

Hello, With VMware vSphere 8, many important changes have been made to the vSAN architecture. In addition to the ESA architecture that comes with vSphere v8, v8 U2 introduced the vSAN MAX architecture, which provides significant enhancements for SDS/HCI platforms. Another new feature is VMware Core Subscription Licensing, which allows you to build a more modular platform by providing even more flexibility and functionality. Personally, I like the core-based subscription licensing.

As someone who started using it months ago, I found it more cost-effective than the previous socket-based per-CPU licensing. VMware has bundled many of its products here, allowing you to use more integrated and functional environments. With add-on licensing, you can grow very flexibly by getting additional features as you need them

VMware VCF
VMware VCF product
VMware VVF

vSAN is very flexible compared to other HCI solutions and has a very low overhead because it works as a native service. vSAN’s beauty is that it is very easy to install, maintain, upgrade, manage and monitor. Even 1 day is enough to learn, which makes it very easy for operations teams compared to open-source SDS solutions. As a structure, there is no need for storage control VM or container-based service and operation. This makes it very attractive. Otherwise, there can be both troubleshooting difficulties and computational overhead.

What is vSAN ESA Architecture?

What makes vSAN ESA different from the vSAN OSA architecture?

It comes with a completely new Log-structured Filesystem (vSAN LFS) structure.
vSAN provides very flexible metadata management. It uses the data tree structure for more efficient reconciliation.

As shown in the figure below, we can use a performance-oriented scalable structure by keeping metadata pages in memory. It can provide higher performance by providing the actively accessed blocks that are desired to be retrieved from disk via memory.

vSAN ESA Object

The vSAN ESA can use the adaptive write method by selecting one of the most appropriate buses based on the incoming I/O request. The default bus handles a small I/O, while the second, larger bus is prioritized to handle larger incoming I/O requests. This allows write performance to scale based on the type and size of the workload, rather than using the same write operation for different workloads.

This ensures high IOPS, high bandwidth, and low latency, especially for write-intensive workloads.

RAID6 write request with ESA, LFS file system provides almost the same performance as RAID1 write request. This allows you to write close to RAID1 performance and save 50% of the data.

LFS

For read requests, the request is first processed by the Distributed Object Manager (DOM) client. The DOM client checks the cache for recently used blocks, and if the available block is in the cache, the read operation is performed immediately. If the block is not in the cache, it proceeds to the second step. The LFS checks if the requested data is held as an in-memory buffer, if not it goes to the third step.

A search is performed on the metadata and the data is found in the metadata, known as the B-tree. The request is sent back to the DOM client and the checksum is checked. If the data is compressed, it is decompressed and the read request is completed.

DOM

With ESA, Native Key Provider, you can build your own cluster encryption without using an external KMS. The Data Encryption Key (DEK) is shared by all nodes in the cluster. Therefore, it can be moved or read between all nodes without the need for decryption. Compared to the OSA architecture, the encryption resource requirements for each I/O operation are greatly reduced. In addition, compression is not affected by encryption.

RAID

ESA also provides native snapshot support on vSAN clusters. ESA Snapshot uses the B-tree table instead of the traditional chain snapshot, as well as the write and read request. The LFS file system provides metadata about which data is written to which snapshot. Snapshot deletion is nearly 100 times faster. When a snapshot is deleted on ESA, the snapshot deletion process is largely a metadata data deletion process. Deletion is logical, and then metadata and data are removed at the appropriate time. With version 8, a maximum of 32 snapshots can be taken per object.

Vsphere datastores

ESA architecture claims each disk independently. In an OSA architecture, if one cache disk per disk group fails, all the capacity disks connected to that cache disk are out of service. As shown in the example below, there are 2 different disk groups and a total of 12 8TB capacity disks. Assuming 50% utilization, when the cache disk connected to the second disk group fails, the rebuild process starts for 24TB of data and has a 50% impact on the host capacity. If we observe a similar scenario on ESA, since there is no cache tier, if one of the disks fails, the impact is only as much as the utilization rate of that disk.

vcenter Datastores

I mentioned that ESA provides almost the same performance as the RAID6 write request RAID1 write request that comes with the LFS file system. With the ESA architecture, the number of nodes required for erasure coding has also been updated. With vSAN ESA, a dedicated witness node is no longer required for RAID 1 policy. In OSA architecture, we needed a witness node for RAID1 storage policy. In this way, we can create both RAID1 and RAID5 storage policies in ESA architecture with a minimum of 3 nodes. If we use RAID5, we have a 50% capacity overhead savings. If we use 5 nodes here, then we will have 75% capacity overhead savings with RAID5.

With vSAN ESA, compression is enabled by default. However, if you want, you can create a different storage policy to turn compression off using vCenter Storage Policy Based Management (SPBM). If you ask why this might be necessary, it would be better to use the workload and the files’ own compression capabilities, such as PostgreSQL, Video. In this case, there is no need to recompress already compressed data and you can use the computing power of the CPU more effectively.

vSAN ESA delivers 4x the compression performance of the OSA architecture. While OSA has a theoretical compression ratio of 1:2, ESA can go up to 1:8 depending on the workload. In this article, I have calculated it using the most guaranteed 1:2 ratio.

NOTE: The ESA architecture does not use cluster-level deduplication. Instead, you can use granular compression (cluster-level compression is no longer available), allowing you to apply a very flexible storage policy.

What is VMWare Minimum Disk and Node requirements for vSAN ESA Deployement?

For vSAN ESA, all disks on the node must be NVMe. Unlike OSA, vSAN ESA does not use Cache Tier and Capacity Tier disk groups. Of course, when the architecture removes the Cache Tier tier, there is no need for cache disks. Therefore, we now use all disks on the node as NVMe of the same type and capacity. In a way, when we use mixed-use disks with ESA, it is as if all of our disks are already running on the cache tier layer. This is one of the main reasons why the architecture provides high compression and I/O.

Read-intensive disks are now supported in the vSAN ESA architecture. However, we cannot use any read-intensive disk here. There must be certain requirements such as high performance class. In addition, the DWPD (endurance value) must be at least 1. If you ask if there is a cost advantage compared to Mixed Use NVMe, you will see that there is not much difference when you look at the list prices. For this reason, I used mixed-use disks in the BOM list I created. Here, the disk type depends on your read/write needs on the cluster. If you need a lot of reads, then using read-intensive NVMe can make a difference in performance.

  • Type: NVMe TLC
  • Performance Class:Class F (100000–349999) or higher
  • Endurance Class:1 DWPD or higher
  • Capacity:1.6TB or higher
  • vSAN ESA ReadyNode Minimum # Device/Host:2
  • vSAN Max ReadyNode Minimum # Device/Host:8

Can we use different capacities of NVMe disks on the same host in the ESA architecture? The answer is YES.

But my preference would still be to use symmetric disks. The nice thing about using different capacity disks in an asymmetric vSAN cluster is that if different capacity disks become available on another ESA cluster in the future, you can use them on different cluster nodes. (Ready node vendor and servers must be conjugate!)

With vSAN ESA, cluster-level object capacity has increased from 9000 to 27,000. With vSAN 8 U2, you can go up to 500 VMs per node on ESA clusters. On the OSA side, the limit is still 200 VMs.

Capacity

In the new licensing packages, it is the most logical choice to use at least 16 cores per server. Due to the core-based subscription, we no longer have a core limit per socket. (We are no longer stuck with the limit of 32 cores per socket) While the number of cores per socket has increased so much, it is now more advantageous to use high cores per socket on cluster with high compute requirements.

To give an example, the cost of 2 x 64 Core CPUs is almost the same as the cost of 4 x 32 Core CPUs. Just like the cost of 2 32GB RDIMMs and 1 64GB RDIMM is close to each other.

In this way, you significantly reduce your white space, server and infrastructure requirements. If you ask how much you will gain in server cost, you may have a gain of 20%. But if you serve as a large Cloud Provider, you can be quite profitable in white space and infrastructure (like 50%)

In terms of energy and BTUs, there is not much difference because high core CPUs have a very high TDP value.

CPU models such as the new 5th Generation Intel Xeon Platinum 8592+ (350W) or 4th Generation AMD EPYC 9654 (360W) have very high energy requirements. If the nodes will also be AI Ready, there will be an additional requirement of approximately 350W per GPU. In this case, when using 2 sockets and 2 GPUs per node, there will be a requirement of approximately 1.4kW. If we take into account Disk and Fiber NICs, there will be at least 1.8–2.4 kW PSU requirement per node. If we take into account that we will cool this server in BTUs, then the advantage on the white space, infrastructure side and the cost on the energy, cooling side will be almost equalized.

In fact, in some server models today, Liquid Cooling cooling has almost become a necessity instead of Air Cooling.

The IT world is always equalizing the units somehow, isn’t it? 😊

What is ESA Architecture Network Requirements?

In ALL-Flash OSA architecture, at least 10G network was sufficient. But I usually prefer 25G network for vSAN VMkernel. If you are managing a small OSA cluster such as 3/4/6 nodes, 10G fiber might be sufficient for vSAN network, but if you need high capacity above 6 nodes, it makes more sense to use 25G.

The cost of 10G/25G switches and SFP+/SFP28 modules are very close. For this reason, if you are investing in a new cluster, it may be beneficial for you to choose Dual Rate 10/25G switch.

In addition, even in OSA architecture, some workloads can have very high read/write requirements. If you have such applications and want to run them on vSAN, 25G network will make you comfortable in the future.

Let’s come to the ESA side, VMware’s comment on this is very clear. If the number and capacity of NVMe disks you use in ESA architecture is high, even 25G network will be risky. If you are managing a small ESA cluster, you need to use at least 25G network. If the cluster is larger, you will either do 25G+25G LACP/LAG or use 100G active/passive network.

It is very important that the vSAN VMkernel network is 100G, especially if you are using large numbers and 15TB NVMe disks in large environments!
If you are using vSAN MAX, you will need at least a 50G network anyway, so it might be more appropriate to choose a 100G switch here. That way you don’t need to use LACP/LAG.

What is vSAN MAX Architecture?

We actually saw the basic version of the vSAN MAX architecture as HCI Mesh in vSAN 7 U1. It evolved through vSAN 8 U2, and finally matured and became available with vSAN 8 U2. With HCI Mesh, we were already able to share capacity between different vSAN clusters. With vSAN MAX, we can now completely decouple and provide storage resources to standard vSphere clusters. In a sense, we can completely decouple VMware environments from SAN storage and SAN switch architecture. Or we can provide both SAN and HCI resources to non-HCI vSphere clusters using hybrid datastores. (SAN datastore may require SAN HBA depending on the situation).

vSAN Capacity

vSAN MAX runs only on ESA architecture and scales as the primary storage resource for your vSphere clusters.

Each vSAN Max cluster can scale up to 8.6 petabytes.

What is the major Use Cases for vSAN ESA?

Infrastructure cost optimization: vSAN Max enables right-sizing of resources to reduce licensing costs.

Unified storage: vSAN Max lets you use server resources that are not ideal for HCI (such as Gen1 blade or legacy hardware). If you want to take advantage of vSAN and keep resources independent of each other, this is the solution for you. You can easily deploy a shared storage platform in the datacenter.

Fast scaling for Cloud Native applications: vSAN Max can be an ideal storage resource for Cloud Native applications.

The minimum requirement for a vSAN Max cluster.

Minimum requirement for a vSAN Max cluster

Other vSphere or HCI cluster types that are supported/not supported for vSAN Max.
Other vSphere or HCI cluster types that are supported/not supported for vSAN Max.

Should vSAN Max Cluster and client vSphere Cluster use the same CPU manufacturer?

NO, processors from different manufacturers can be used on vSAN Max cluster and client vSphere cluster. For example, a client vSphere cluster using AMD can be connected to a vSAN Max cluster using AMD. (Or vice versa) The important thing is that the vSAN Max cluster uses hardware compatible with the ESA architecture (such as Disk, NVMe Controller, NIC)

A maximum of 128 hosts can be connected to the vSAN Max cluster.

What is vSAN Max Architecture Network Requirements?

vSAN Max requires high network capacity due to its architecture.

  • Datacenter-class redundant switch
    100Gb uplink for vSAN Max cluster VMkernel (at least 10Gb is sufficient for client vSphere clusters, but at least 25Gb is recommended depending on workload.
  • It is recommended to use NIOC (vSphere Network I/O Control) for vSphere Distributed Switch.
  • LACP can be used for 10G or 25G network infrastructures, but active/passive connection is the most suitable alternative due to operational complexity.

Storage Policy Based Management (SPBM) is included for all vSphere clusters on vSAN Max.

Cross-cluster vMotion is supported for client vSphere clusters. VMs can be moved between client vSphere clusters that are connected to a vSAN Max cluster.

Storage Policy Based Management (SPBM)
frf3rfrf3r

What is vSAN ESA Licensing Calculations?

Licensing is the same as vSAN ESA architecture. You can calculate the RAW capacity of the vSAN Max cluster in TiB and purchase it as an add-on for both VCF and VVF.

As a sample calculation

For vSAN Advanced/Enterprise Add-On TiBs, we use the formula (vSAN: Subscription capacity is the total number of TiBs x number of ESXi hosts in each vSAN cluster).

In the example vSAN Max scenario, if we calculate 16 6.4TB NVMe disks per node and 10 nodes per cluster.

102.4 TB x 10 = 1024TB RAW with 932 TiB subscription capacity.

The vSphere version of the nodes used in the vSphere cluster that will access vSAN Max must be at least 8. No additional licensing is required for the vSphere cluster servers that will connect to the vSAN Max datastore. (Basic vSphere licensing is a must)!

Is vSAN ESA or vSAN Max suitable for every workload?

In my experience, one of the biggest mistakes made in HCI architectures is the positioning of individual workloads. If you are going to run workloads like Microsoft SQL AAG, Hadoop, File Server on HCI, you should analyze and plan very well, otherwise it’s more trouble than it’s worth 🙂

If you are running workloads like Microsoft SQL AAG and Hadoop on HCI, this is not very logical. Because those systems have a redundancy structure in them. If you say I will use it on HCI, I recommend that you use RAID0 storage policy for such workloads. That way you can use the capacity on HCI more effectively. It is not very correct to use RAID1/5/6 storage policy on the vSAN side for a system with node-to-node redundancy in itself.

Especially if you are using vSAN ESA disks as archive disks for a service like SIEM, you are making a big mistake. Again, using a high-capacity file server on architectures like HCI or ESA is completely unnecessary.

This recommendation applies not only to vSAN or other HCI vendors, but also to legacy All-NVMe SAN storage.

Unstructured structures are a complete waste for HCI and NVMe SAN in my opinion, let them work the way they know how. That way you are not putting capital on the line.

Another recommendation is that if you are going to use a large cluster in your network infrastructure, make a switch investment accordingly. Especially in the telco sector, if you are going to manage HCI as an infrastructure, you should definitely go one tier above the minimum requirement.

What workloads is vSAN ESA suitable for?

First, I think VDI is perfect for HCI.

Second, standard VM workloads (e.g. application server, web server, standalone DB server)

Third, it is now a very suitable solution for Telco 5G RAN (there are very nice, rugged servers for 5G RAN, I will definitely do a review about them later)

Fourth, if you want to connect small branches such as RO-BO to the central office, it is a suitable solution for both the central office and the branches.

Fifth, cloud native infrastructures. When we talk about VMware, Tanzu and vSAN are a very good duo and they work very integrated with each other.

Sixth, if you are a service provider, it is ideal for private cloud and public cloud service, you just need to analyze and plan properly.

4 thoughts

Leave a Reply

Your email address will not be published. Required fields are marked *