aicoolies logo

# llama

6 tools tagged

Showing 6 of 6 tools

Cerebras logo

Cerebras

Wafer-scale inference at thousands of tokens per second

Cerebras Inference serves open-weight LLMs like Llama, Qwen, and GPT-OSS on wafer-scale CS-3 chips through an OpenAI-compatible API, benchmarking between 1,800 and 2,600 output tokens per second on Llama 3.1 8B and several hundred on 70B models. A free tier offers one million tokens per day with no credit card, while paid pay-per-token pricing starts at $0.04 per million tokens for the smaller Llama models.

freemium
AWS Bedrock logo

AWS Bedrock

Managed foundation models on AWS

Fully managed AWS service providing enterprise access to 100+ foundation models from Anthropic, Meta, Mistral, Cohere, and Amazon's Nova family through a single API. Bedrock includes AgentCore for agent runtime, Knowledge Bases for RAG, Guardrails blocking 88% of harmful content, plus Model Distillation, Prompt Caching, and Intelligent Prompt Routing for cost optimization.

api-usage-based
Fireworks AI logo

Fireworks AI

Production-grade inference with serverless and on-demand GPUs

High-performance inference platform serving open-source and custom AI models at global scale, processing 13+ trillion tokens daily at ~180K requests per second. Fireworks AI delivers 1,000+ tokens per second on large models through quantization-aware tuning and adaptive speculation, with serverless, fine-tuning, and dedicated GPU options across text, image, and audio modalities.

freemium
Groq logo

Groq

Ultra-fast LPU inference for open-weight models

Groq is an AI inference provider built around custom Language Processing Unit (LPU) hardware for low-latency open-weight model serving. GroqCloud exposes an OpenAI-compatible API for Llama, GPT-OSS, Qwen, Kimi, DeepSeek, Gemma, Whisper, and related models, with high token-throughput positioning, model-specific rate limits, and usage-based pricing.

freemium
Together AI logo

Together AI

Open-weight inference, fine-tuning, and GPU-cloud platform

Together AI is a cloud platform for running, fine-tuning, batching, and training open-weight AI models. It supports serverless inference, dedicated endpoints, LoRA and full fine-tuning, GPU clusters, code-execution sandboxes, and async batch jobs up to 30B tokens per model. Current docs list fast-moving families such as Qwen, Kimi, GLM, GPT-OSS, DeepSeek, Llama, MiniMax, and Mistral.

api-usage-based
Ollama logo

Ollama

Run LLMs locally with one command

Tool for running large language models locally on your machine with a simple CLI interface. Download and run Llama 3, Mistral, Gemma, Phi, Code Llama, and dozens of other open-source models with a single command. Features model management, GPU acceleration (NVIDIA/AMD/Apple Silicon), OpenAI-compatible API server, Modelfile for customization, and multi-model switching. Ideal for offline AI development, privacy-sensitive use cases, and local testing. 120K+ GitHub stars.

open-sourceOpen Source