GPT-4o mini, Mistral NeMo and Llama 3 Lite

July 20th, 2024

It’s been a big week for small models!

$0.15/m input tokens and $0.60/m output tokens
Supports 128,000 input tokens and 16,000 output tokens
Appears to benchmark higher than Haiku and Gemini 1.5 Flash (both are more expensive)
Image inputs remain the same price as GPT-4o so Haiku or Gemini 1.5 Flash seem more attractive for this use case
Has become the new default model for free users of ChatGPT. You continue to get a limited number of calls to GPT-4o, and then once exhausted it falls back to GPT-4o mini. GPT 3.5 turbo is no longer available for free users.

Together AI are using INT4 quantization, as well as a number of other optimizations, to offer Llama 3 8B at $0.1/m tokens (both input & output)
I think this is the cheapest 8B model endpoint I have seen