Product	Qty	Amount
	From: Until:

Optical Interconnect for Hyperscale Data Center with AI / ML Applications

by
in data center hyperscale Charles Su

[June 16, 2023] The demands of modern Hyperscale (HPC) and AI/ML workloads are pushing the capabilities of system architectures to their limits. CPU and GPU designs are continuously delivering higher performance, enabling faster computation. System memory capacities are also increasing to meet the growing demands of these workloads.

Lower latency requirements are driving the need for increased density in system architectures. As workloads become more complex, the need for efficient data movement and communication between components becomes critical. This has led to innovations in packaging, such as chiplet technology, where multiple smaller chips are combined to form a larger system.

However, these advancements in processing power, memory capacity, and packaging also highlight the need for improved off-package interconnect solutions. The interconnects between different components in a system play a crucial role in delivering optimal system performance characteristics. As workloads become more demanding, the performance of these interconnects becomes a limiting factor.

To address this challenge, there is a requirement for improved off-package interconnect solutions that can provide the necessary bandwidth, low latency, and scalability to meet the needs of modern HPC and AI/ML workloads. These solutions need to strike a balance between performance, power efficiency, and cost-effectiveness to deliver optimal system performance.

The current system designs are facing limitations in keeping up with the demands of modern AI workloads. The largest AI workloads have reached a point where they are pushing the physical limitations of standard electrical interconnects. As AI models and data sets grow larger and more complex, the amount of data that needs to be transferred between components increases exponentially. This puts a strain on the bandwidth and latency capabilities of standard electrical interconnects, leading to performance bottlenecks.

In addition to interconnect limitations, memory capacity constraints and stranded resources within today's systems further amplify performance and efficiency losses at scale. As workloads become more demanding, the limited capacity of system memory can become a bottleneck, hindering the overall performance. Similarly, stranded resources, where certain components or processing capabilities are underutilized, can lead to inefficiencies in the system.

These challenges highlight the need for rapid and significant technology innovations in scale-out systems. Architectural advancements are required to overcome the limitations of current system designs. This includes exploring alternative interconnect solutions that can provide higher bandwidth and lower latency, such as optical I/O. Optical interconnects have the potential to address the physical limitations of electrical interconnects and improve the overall performance and scalability of AI workloads.

Moreover, developments in memory technologies, such as high-bandwidth memory (HBM) and in-memory computing, can help alleviate memory capacity limitations and improve the efficiency of data processing.

The industry recognizes several key areas that have the potential to address system architectural challenges and drive significant impact in the next 2 to 4 years. These areas include optical I/O connectivity, in-memory computing, and physical interface standards for chiplets. Additionally, there is a consensus among industry experts regarding the need for disaggregation of system resources to enable workload-driven composable infrastructure. Let's explore each of these areas further:

Optical I/O Connectivity: Optical I/O, or photonics, is considered a highly impactful technology in addressing system architectural challenges. Optical interconnects offer higher bandwidth, lower latency, and improved energy efficiency compared to traditional electrical interconnects. By leveraging optical I/O connectivity, systems can overcome interconnect bottlenecks and maximize the potential of scale-out architectures, particularly for AI workloads that require massive data transfers. The adoption of optical I/O promises to enhance system performance and enable increased platform flexibility.
In-Memory Computing: In-memory computing refers to the practice of processing data directly in the memory, bypassing the traditional approach of fetching data from storage or disk. This approach reduces latency and improves overall system performance by enabling faster data access and processing. With the increasing demands of AI and ML workloads, in-memory computing becomes crucial for handling large datasets and complex computations. By leveraging in-memory computing technologies, systems can achieve significant performance gains and overcome memory capacity limitations.
Physical Interface Standards for Chiplets: Chiplet technology, which involves integrating multiple smaller semiconductor chiplets into a single package, offers benefits such as improved flexibility, scalability, and time-to-market. The development of physical interface standards for chiplets, such as UCIe (Unified Chiplet Interface), enables standardized connections between the host System-on-Chip (SoC) and in-package optical I/O chiplets. This standardization facilitates interoperability, simplifies integration, and promotes ecosystem growth. By adopting physical interface standards for chiplets, systems can leverage the advantages of chiplet technology more effectively.
Disaggregation of System Resources: The disaggregation of system resources involves decoupling different components, such as memory, accelerators, and processors, to create a more flexible and scalable infrastructure. This approach allows resources to be allocated dynamically based on workload requirements, enabling workload-driven composable infrastructure. By disaggregating system resources, organizations can optimize resource utilization, improve scalability, and achieve greater efficiency in handling diverse workloads.

The industry's alignment on these key areas demonstrates a recognition of the challenges and the need for significant advancements in system architectures to address the demands of modern workloads, particularly in the field of AI and ML. By focusing on optical I/O, in-memory computing, physical interface standards for chiplets, and resource disaggregation, organizations can strive towards building more powerful, efficient, and flexible systems capable of handling the growing complexities of emerging workloads.

The progress in AI over the past few years has been remarkable, with significant increases in the size and complexity of AI models. These have led to a surge in the number of parameters used in AI models, which can be seen as a measure of their complexity and capacity to learn. For example, the Transformer model [1] with 465 million parameters in 2019, and the Gshard MoE model [2] with over a trillion parameters in mid-2020, highlight the rapid growth in model size.

These models are becoming increasingly capable of processing and understanding complex data, especially in tasks like natural language processing (NLP), computer vision, and recommender systems. As AI models continue to grow, they have the potential to achieve human-level performance in certain areas.

However, to fully harness the potential of these large-scale AI models, it is crucial to have computing infrastructure that can support their computational requirements. The demands placed on computing systems by these models are significant, and existing system designs may not be able to keep up. As the size and complexity of AI models increases, the need for high-performance computing systems with optimized architectures, efficient interconnects, and sufficient memory capacity becomes paramount.

To meet the computational demands of future AI models with billions or even trillions of parameters, novel solutions in system architectures, interconnect technologies, memory capacity, and processing capabilities are necessary. This is where the industry's focus on technologies like optical I/O connectivity, in-memory computing, and chiplet-based solutions can play a crucial role in enabling the computing infrastructure required to support these massive AI models.

One approach to address this challenge is by increasing the number of computing nodes or adding more processing units, which can enhance computational throughput. However, the effectiveness of scaling through the addition of nodes is limited by the speed and capacity of the interconnect fabric that connects these nodes. If the bandwidth of the interconnect fabric remains low, the returns from adding more nodes will diminish rapidly, leading to diminishing performance gains.

The problem becomes even more pronounced as AI researchers aim to run more experiments and require increased all-to-all connectivity. As the need for information exchange between nodes increases, the limitations of the interconnect fabric can become a significant bottleneck. The proportion of time spent on information exchange relative to computation increases, which hampers the overall system efficiency and scalability.

Scaling out nodes with copper-based components can further exacerbate these challenges. Copper interconnects, while widely used, have limitations in terms of bandwidth, cost, power consumption, density, weight, and configuration flexibility. These limitations can restrict the scalability of systems, particularly when dealing with large-scale AI workloads.

To overcome these interconnect bottlenecks and enable the further scaling of AI systems, there is a need for new solutions in interconnect technologies. This includes exploring higher-bandwidth solutions, such as optical interconnects, that can provide faster and more efficient communication between nodes. Optical interconnects have the potential to offer higher bandwidth, lower latency, and improved energy efficiency compared to traditional copper-based interconnects. By leveraging optical I/O technology, it may be possible to alleviate the limitations imposed by current interconnect fabrics and enable the scaling of AI systems to meet the demands of rapidly growing AI models.

Vertically integrated vendors and hyperscalers are looking towards photonics, specifically optical I/O, as a promising solution to overcome interconnect bottlenecks and unlock the full potential of scale-out systems. Optical I/O offers several advantages that make it an attractive option for addressing the limitations of traditional interconnect technologies.

One major advantage of optical I/O is its potential for higher bandwidth and lower latency compared to electrical interconnects. Optical interconnects can support much higher data rates, allowing for faster communication between computing nodes and reducing the time spent on data transfer. This higher bandwidth and lower latency are crucial for meeting the demands of AI workloads, which require rapid data exchange between nodes.

Furthermore, optical I/O can provide increased platform flexibility for new architectures. In-package optical interconnect solutions, where optical components are integrated directly into the package of a computing node, offer benefits such as reduced power consumption, improved signal integrity, and higher interconnect density. This integration of optical interconnects within the package can enable more efficient communication between different components, such as accelerators, processors, and memory, leading to enhanced system performance and scalability.

In addition, the concept of disaggregated architectures, which involve decoupling memory from accelerators and processors, can also leverage optical interconnects to maximize the potential of scale-out systems in handling AI workloads. By separating memory and processing units and using optical interconnects to connect them, it becomes possible to overcome the limitations imposed by traditional electrical interconnects. This approach enables more flexible resource allocation and efficient scaling of computing resources, improving overall system performance.

Amphenol network Solutions [3], is one of the companies investing in the development of turn-key solutions for hyperscale data center applications. These solutions involve leveraging optical I/O technologies to provide high-performance interconnect solutions for large-scale data centers and address the challenges associated with interconnect bottlenecks in AI workloads.

In summary, the transition to photonics and optical I/O holds significant promise in overcoming interconnect bottlenecks and maximizing the potential of scale-out systems in handling AI workloads. By offering higher bandwidth, lower latency, and increased platform flexibility, optical interconnects can enable more efficient communication between computing nodes and support the continued growth and performance requirements of AI models in the future.

SOURCES

[1] https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/

[2] https://openreview.net/pdf?id=qrwe7XHTmYb

[3] https://amphenol-ns.com/ANS-Products/

Charles Su

Senior Optical Engineer

Charles Su (PhD) is a seasoned senior engineer with more than 20 years of experience in the telecom industry. With a strong background in optical fiber components and systems, he has successfully led teams and delivered impactful solutions to address complex challenges. Charles Su played a key role in the successful collaboration with customers in Telecom, MSO, and Hyperscale, leveraging his expertise to understand their needs and develop innovative solutions. Charles Su is passionate about driving customer success and is committed to delivering exceptional results through his deep understanding of the industry and dedication to continuous improvement.

Tags: data center hyperscale Charles Su

Blog

Optical Interconnect for Hyperscale Data Center with AI / ML Applications