6.4 CUDA-free Petaflops Generative AI
The "Memory Wall"
Why Generative AI software is ready but hardware not
The “Memory Wall” was first conceived as a theory by Wulf and McKee in 1994. It posited that the development of the processing unit (CPU) far outpaced that of the memory. As a result the rate at which the data can be transferred to and from memory will force the processor to wait until the data is available for the processing.
In traditional architectures the problem has been mitigated by a hierarchical memory structure built around multiple levels of cache staging the data to minimize the amount of traffic to the main memory or to the external memory.
The recently introduced Generative AI (for example, ChatGPT, DALL-E, Diffusion, etc.) dramatically expanded the amount of parameters necessary for performing the task at hand. For example, GPT-3.5 requires 175 billion parameters and GPT-4, launched in April 2023, supposedly requires almost 2 trillion parameters. All of these parameters needs to be accessed during inference or training, posing a problem as traditional systems are not designed to handle such vast amount of data without resorting to the traditional hierarchical memory model. Unfortunately the more levels needed to be traversed to read or store the data the longer time it takes. As a result the processing elements will be forced to wait longer and longer for data to process, lengthening the latency and dropping the implementation efficiency.
Recent findings show that the efficiency running GPT-4, the most recent GPT algorithm, drops to around 3%. That is, the very expensive hardware designed to run these algorithms sits idle 97% of the time!
The flip-side is that the amount of hardware required to reach reasonable compute numbers will be staggering. In July 2023, EE Times reported that Inflection is planning to use 22,000 Nvidia H100 GPUs in their supercomputer, an investment of ~$800M. Assuming an average power consumption of 500Watts per H100, the total power draw would be an astounding 11 MWh!
Based on a fundamental new architecture, Jotunn allows data to be fed to the processing units 100% of the time regardless of the number of compute elements. Algorithm efficiencies, even for the large models like GPT-4, will exceed 50%. Jotunn4 will significantly outperform anything that is currently on the market!
Generative AI Platform
Fully programmable companion chip family for Generative AI inference.
- Algorithm and host processor agnostic
- Straightforward integration in existing s/w & h/w architecture
- Software oriented design flow
AI & GP processing, selectable layer-by-layer
- Minimizes latency and power consumption
- Increases flexibility
- High-level programming throughout
- Algorithm agnostic
- GPT-3 processing on a single chip
Very high performance
- <40W / Petaflops
- Close to theory implementation efficiency
- Sparsity on data and weightsperformed on the fly
IEEE754 floating point / Integer
- fp8 / fp16 / fp32
- int8 / int16 / int32
- 180W (peak)
Performance numbers at 1.6 GHz
- 6,400 Tflops* (fp8 Tensorcore)
- 1,600 Tflops* (fp16 Tensorcore)
- 25 Tflops (fp32)
- 50 Tflops (fp16)
- 100 Tflops (fp8)
- 180W (peak)
- 192GB on-chip memory
Curious to find out more?
We would love to tell you more about our Generative AI solutions and how this can be of benefit to you.
Send us a mail to firstname.lastname@example.org
We will take it from there!