VSORA

Jotunn & Tyr

Generative AI

Highest Performance Inference
Lowest Deployment cost

What Is Generative AI

Generative AI refers to a subset of artificial intelligence (AI) techniques that involve creating or generating new data, content, or outputs that mimic or resemble human-generated content. Unlike traditional AI systems that are primarily used for classification, prediction, or optimization tasks, generative AI focuses on the creation of new content, such as images, text, music, or even videos.

Generative AI techniques typically involve deep learning models, particularly variants of neural networks like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and autoregressive models. These models are trained on large datasets to learn the underlying patterns and structures of the data, enabling them to generate new, realistic content that is similar to the training data.

Latency is crucial for generative AI inference for several reasons:

  1. Real-Time Applications: In many real-time applications such as video games, virtual reality, or live video streaming, low latency is essential to maintain a smooth and immersive user experience. Generative AI models used in these applications need to generate content quickly to respond to user inputs or changes in the environment.
  2. Interactive Systems: Generative AI is increasingly being used in interactive systems where users expect immediate feedback or responses. For example, in chatbots or virtual assistants, low latency ensures that responses are generated quickly, maintaining the conversational flow and user engagement.
  3. Dynamic Environments: In dynamic environments where conditions change rapidly, such as autonomous vehicles or robotics, generative AI models must be able to adapt and generate appropriate responses in real-time to ensure safe and effective operation.
  4. Scalability: Low latency becomes even more critical in systems with high scalability requirements, such as cloud-based services or distributed applications. Minimizing latency enables these systems to handle large numbers of concurrent users or requests efficiently.
  5. User Experience: Latency directly impacts the user experience, particularly in applications where users interact with generated content in real-time. High latency can lead to delays, interruptions, or a sense of disconnection, ultimately diminishing the quality of the user experience.
  6. Feedback Loops: In some generative AI systems, such as those using reinforcement learning, low latency is necessary to maintain fast feedback loops between actions taken by the model and the resulting outcomes. This enables the model to learn and improve more rapidly.

Reducing latency in generative AI inference often involves optimizing the model architecture, leveraging hardware acceleration (such as specialized AI chips), implementing efficient algorithms, and optimizing the deployment infrastructure. Balancing the trade-offs between latency, model complexity, and computational resources is essential to achieve optimal performance in generative AI systems.

 

The "Memory Wall"

Why Generative AI software is ready but hardware not

The Memory Wall
The “Memory Wall” was first conceived as a theory by Wulf and McKee in 1994. It posited that the development of the processing unit (CPU) far outpaced that of the memory. As a result the rate at which the data can be transferred to and from memory will force the processor to wait until the data is available for the processing. In traditional architectures the problem has been mitigated by a hierarchical memory structure built around multiple levels of cache staging the data to minimize the amount of traffic to the main memory or to the external memory. The recently introduced Generative AI (for example, ChatGPT, DALL-E, Diffusion, etc.) dramatically expanded the amount of parameters necessary for performing the task at hand. For example, GPT-3.5 requires 175 billion parameters and GPT-4, launched in April 2023, supposedly requires almost 2 trillion parameters. All of these parameters needs to be accessed during inference or training, posing a problem as traditional systems are not designed to handle such vast amount of data without resorting to the traditional hierarchical memory model. Unfortunately the more levels needed to be traversed to read or store the data the longer time it takes. As a result the processing elements will be forced to wait longer and longer for data to process, lengthening the latency and dropping the implementation efficiency. Recent findings show that the efficiency running GPT-4, the most recent GPT algorithm, drops to around 3%. That is, the very expensive hardware designed to run these algorithms spends 97% of the time preparing data to be processed! The flip-side is that the amount of hardware required to reach reasonable compute numbers will be staggering.  In July 2023, EE Times reported that Inflection is planning to use 22,000 Nvidia H100 GPUs in their supercomputer, an investment of ~$800M. Assuming an average power consumption of 500Watts per H100, the total power draw would be an astounding 11 MWh! Based on a fundamental new architecture, Jotunn allows data to be fed to the processing units 100% of the time regardless of the number of compute elements. Algorithm efficiencies, even for the large models like GPT-4, will exceed 50%. Jotunn8 significantly outperforms anything that is currently on the market!

Jotunn8

Cloud and On-Premise Inference

Any Algorithm
Any Host processor
Fully programmable Companion Chip

Jotunn8
Core Architecture
Fully programmable
  • 16 cores
  • High-level programming throughout
  • Algorithm agnostic
  • GPT-3 processing on a single chip
AI & GP processing automatically selected layer-by-layer
  • Minimizes latency and power consumption
  • Increases flexibility
Very high performance
  • Close to theory implementation efficiency
  • Allows very large LLMs (eg. GPT-4) to be deployed at  <$0.002/query
Specifications
  • 6,400 Tflops* (fp8 Tensor Core)
  • 1,600 Tflops* (fp16 Tensor Core)
  • 100 Tflops (fp8)
  • 50 Tflops (fp16)
  • 25 Tflops (fp32)
  • 192 GB on-chip memory
  • 180W (peak power consumption)

* = sparsity

Tyr Family

Any Algorithm
Any Host processor
Fully programmable Companion Chip

Tyr4

Tyr4
Fully programmable
  • 8 cores
  • High-level programming throughout
AI & GP processing automatically selected layer-by-layer
Very high performance
  • Close to theory implementation efficiency
Specifications
  • 3,200 Tflops* (fp8 Tensorcore)
  • 800 Tflops* (fp16 Tensorcore)
  • 50 Tflops (fp8)
  • 25 Tflops (fp16)
  • 12 Tflops (fp32)
  • 16 GB on-chip memory
  • 60W (peak power consumption)

* = sparsity

Tyr2

Tyr2
Fully programmable
  • 4 cores
  • High-level programming throughout
AI & GP processing automatically selected layer-by-layer
Very high performance
  • Close to theory implementation efficiency
Specifications
  • 1,600 Tflops* (fp8 Tensorcore)
  • 400 Tflops* (fp16 Tensorcore)
  • 25 Tflops (fp8)
  • 12 Tflops (fp16)
  • 6 Tflops (fp32)
  • 16 GB on-chip memory
  • 30W (peak power consumption)

* = sparsity

Tyr1

Tyr1
Fully programmable
  • 2 cores
  • High-level programming throughout
AI & GP processing automatically selected layer-by-layer
Very high performance
  • Close to theory implementation efficiency
Specifications
  • 800 Tflops* (fp8 Tensorcore)
  • 200 Tflops* (fp16 Tensorcore)
  • 12 Tflops (fp8)
  • 6 Tflops (fp16)
  • 3 Tflops (fp32)
  • 16 GB on-chip memory
  • 10W (peak power consumption)

* = sparsity

be a part of something great

take the first step.
we will do the rest.

AI
Handshake
Scroll to Top