python-agents-examples/complex-agents/vision at main · livekit-examples/python-agents-examples

Vision Agent

A multimodal voice assistant with vision capabilities that can see and discuss what users show through their camera using LiveKit's voice agents.

Overview

VisionAgent - A voice-enabled AI assistant that combines speech interaction with computer vision, allowing users to show objects, documents, or scenes through their camera and have natural conversations about what the agent sees.

Features

Computer Vision Integration: Processes video frames from user's camera in real-time
Multimodal Conversation: Combines visual context with voice interaction
Automatic Frame Capture: Buffers the latest video frame when users speak
Multi-Track Support: Handles video streams from remote participants
Voice-Enabled: Built using LiveKit's voice capabilities with support for:
- Speech-to-Text (STT) using Deepgram
- Large Language Model (LLM) using X.AI's Grok-2-Vision model
- Text-to-Speech (TTS) using Rime
- Voice Activity Detection (VAD) using Silero
Modern Web Interface: Next.js frontend with video sharing capabilities

How It Works

User connects to the LiveKit room through the web interface
User enables their camera to share video with the agent
The agent subscribes to the user's video track automatically
When the user speaks, the agent captures the current video frame
The captured frame is added to the conversation context along with the transcribed speech
Grok-2-Vision processes both the visual and textual input
The agent responds with voice, able to describe and discuss what it sees
Users can show different objects or scenes and ask questions about them

Prerequisites

Python 3.10+
livekit-agents>=1.0
LiveKit account and credentials
API keys for:
- X.AI (for Grok-2-Vision model access)
- Deepgram (for speech-to-text)
- Rime (for text-to-speech)
Node.js and pnpm (for the frontend)

Installation

Clone the repository
Install dependencies:
```
pip install -r requirements.txt
```

Create a .env file in the parent directory with your API credentials:

LIVEKIT_URL=your_livekit_url
LIVEKIT_API_KEY=your_api_key
LIVEKIT_API_SECRET=your_api_secret
XAI_API_KEY=your_xai_key
DEEPGRAM_API_KEY=your_deepgram_key
RIME_API_KEY=your_rime_key

Running the Agent

Start the agent:
In a separate terminal, navigate to the frontend directory and start the Next.js app:
```
cd agent-vision-frontend
pnpm install
pnpm dev
```

The application will be available at http://localhost:3000. Enable your camera when prompted to start showing things to the agent.

Architecture Details

Main Classes

VisionAgent: Core agent class that handles both voice and vision inputs
Video Stream Management: Automatically subscribes to video tracks from participants
Frame Buffering: Stores the latest video frame for processing when user speaks

Vision Processing Flow

User's video track is detected when they join or publish video
Agent creates a VideoStream to receive frames
Latest frame is continuously buffered as video streams
When user completes their turn (stops speaking), the current frame is captured
Frame is added as ImageContent to the chat message
Grok-2-Vision processes the multimodal input (text + image)
Agent generates a response based on both visual and conversational context

Frontend Features

Video input support with camera selection
Screen sharing capabilities
Chat interface for text input (optional)
Real-time transcription display
Modern, responsive UI with dark mode support

Multimodal Context

The agent maintains conversation context that includes:

User's spoken/typed messages
Captured video frames at the moment of each user utterance
Agent's responses
Full conversation history with visual context

Customization

Change Vision Model: Replace Grok-2-Vision with other multimodal LLMs like GPT-4o or Claude 3
Modify Frame Capture Logic: Adjust when frames are captured (e.g., continuous vs. on-demand)
Add Visual Analysis Tools: Integrate specialized vision APIs for OCR, object detection, etc.
Enhance Agent Instructions: Update the prompt to specialize in specific visual tasks