🚨 VRAM consumption 🚨
The `Llama`, `Cohere` and the `Gemma` model both no longer cache the triangular causal mask unless `static` cache is used. This was reverted by 29753, which fixes the BC issues w.r.t speed , and memory consumption, while still supporting compile and static cache. Small note, `fx` is not supported for both models, a patch will be brought very soon!
New model addition
Cohere open-source model
Command-R is a generative model optimized for long context tasks such as retrieval augmented generation (RAG) and using external APIs and tools. It is designed to work in concert with Cohere's industry-leading Embed and Rerank models to provide best-in-class integration for RAG applications and excel at enterprise use cases. As a model built for companies to implement at scale, Command-R boasts:
- Strong accuracy on RAG and Tool Use
- Low latency, and high throughput
- Longer 128k context and lower pricing
- Strong capabilities across 10 key languages
- Model weights available on HuggingFace for research and evaluation
* Cohere Model Release by saurabhdash2512 in 29622