Sendable Requests to Server
This server allows you to upload audio (.wav) or video (.mp4) recordings, transcribe them, extract emotions from audio and video, and merge these emotions with a large language model (LLM) to generate an annotated transcript.
All communication is done via POST requests to the server endpoint:
https://q0ki6holrvby0r-61016.proxy.runpod.net/upload
Each request uses FormData with a command key and optional additional fields.
Note: This server does not rely on session management. To start any workflow, first upload a
.wavor.mp4file. Re-uploading will overwrite existing recordings.
1. Upload Recording
Command: upload_recording
FormData:
formData.append("command", "upload_recording"); formData.append("file", <File Object>);
Status:
- Success:
{
"message": "Recording saved at: tmp/recording.wav: Success"
}or
{
"message": "Recording saved at: tmp/recording.mp4: Success"
}- Failure:
{"error": "No Recording Uploaded!"}{"error": "Unsupported file format: <file_content_type>"}1a. Upload Keyboard Video
Command: upload_keyboard_video
FormData:
formData.append("command", "upload_keyboard_video"); formData.append("file", <video File Object>); formData.append("keyboard_video", JSON.stringify({ input_text: "...", video_timestamps: [...] }));
Status:
- Success:
{"message": "Video and keyboard-video correspondence saved successfully"}- Failure:
{"error": "Missing video file or keyboard-video data"}{"error": "Unsupported file format: <file_content_type>"}{"error": "Invalid keyboard-video JSON: <error>"}Notes:
- Saves video to
tmp/recording.mp4 - Saves keyboard-video correspondence to
tmp/keyboard_video.json - Clears previous tmp files before saving
2. Transcribe Recording
Command: transcribe
FormData:
formData.append("command", "transcribe");
Status:
- Success:
{
"segments": [
{"interval": [0.0, 2.5], "text": "Hello everyone."},
{"interval": [2.5, 5.0], "text": "Today we will discuss emotions."}
]
}- Failure:
{"error": "Error: No recordings avaliable"}Notes:
- For audio, segments are saved in
tmp/audio_segments/ - For video, segments are saved in
tmp/video_segments/
2a. Transcribe Keyboard Video Input
Command: transcribe_keyboard_video_input
FormData:
formData.append("command", "transcribe_keyboard_video_input");
Status:
- Success:
{
"segments": [
{"text": "Hello everyone.", "interval": [0.0, 2.5]},
{"text": "Today we will discuss emotions.", "interval": [2.5, 5.0]}
]
}- Failure:
{"error": "Failed to read tmp/keyboard_video.json"}{"error": "Length of video_timestamps does not match input_text"}{"error": "Failed during tokenization: <error>"}Notes:
- Uses the
keyboard_video.jsontimestamps to segment the video - Saves each video segment in
tmp/video_segments/ - Saves segment transcription to
tmp/transcription.json
3. Emotion from Video
Command: emotion_from_video
FormData:
formData.append("command", "emotion_from_video");
Status:
- Success:
{
"segment_emotions": [
{"facial": "happy"},
{"facial": "neutral"}
]
}- Failure:
{"error": "Error: Emotion from video not supported!"}Notes:
- Only works if the uploaded recording is a
.mp4video. - Uses sampled frames from each video segment to classify facial emotions.
4. Emotion from Audio
Command: emotion_from_audio
FormData:
formData.append("command", "emotion_from_audio");
Status:
- Success:
{
"segment_emotions": [
{"audio": "happy"},
{"audio": "sad"}
]
}- Failure:
{"error": "Error: Emotion extraction not supported!"}Notes:
- Works for both audio and video recordings.
- Uses audio segments from the transcription to classify emotions.
5. Merge Emotions with LLM
Command: merge_emotions_with_LLM
FormData:
formData.append("command", "merge_emotions_with_LLM");
Status:
- Success:
{
"Anotated Transcript": "Hello everyone [happy]. Today we will discuss emotions [neutral]."
}- Failure:
{"error": "Error: No recordings avaliable."}6. Annotate Pure Text
This command provides an annotated transcript directly from a plain text input, without needing any uploaded audio or video or prior emotion extraction. It works similarly to merge_emotions_with_LLM but infers emotions purely from the text.
Command: annotate_pure_text
FormData:
formData.append("command", "annotate_pure_text"); formData.append("text_input", "Your transcript text here");
Status:
- Success:
{
"Annotated Transcript": "Hello everyone [happy]. Today we will discuss emotions [neutral]."
}- Failure:
{"error": "No text input."}Notes:
- Segments the text into multiple portions where emotional tone changes.
- Annotates each segment with an inferred emotion in square brackets.
- Only one emotion per bracket, chosen from:
['angry', 'calm', 'disgust', 'fearful', 'happy', 'neutral', 'sad', 'surprised']. - Returns a plain string of the annotated transcript; original text is preserved.
- No audio or video files are required.
Notes:
- Combines segment-level facial and audio emotions and produces an annotated transcript using the LLM.
- Returns a plain string for the annotated transcript; brackets are filled with your evaluations of emotion.
7. Health Check
Endpoint: /health
Method: GET
Response:
{
"status": "healthy",
"server_status": "idle" // or "busy"
}Notes:
- Indicates whether the server is running and whether it is currently processing a request.
8. Server Status
Endpoint: /status
Method: GET
Response:
{
"status": "idle" // or "busy"
}Notes:
- Shows if the server is currently busy processing a command.
Error Handling
- All command errors return JSON in the format:
{"error": "<error message>"}- Common error messages:
No Recording Uploaded!Unsupported file format: <type>Error: No recordings avaliableError: Emotion from video not supported!Error: Emotion extraction not supported!Missing video file or keyboard-video dataInvalid keyboard-video JSON: <error>Length of video_timestamps does not match input_textFailed during tokenization: <error>
Notes on Usage
- Start your workflow by uploading a
.wavor.mp4file, or upload a keyboard-video file if using that workflow. - Re-uploading a file will overwrite previous recordings.
- Transcription must be performed before emotion extraction.
- Emotions from video or audio can be merged into a final annotated transcript via the LLM.
- JSON outputs show segment-level results for transcription and emotion classification.
merge_emotions_with_LLMoutputs a plain annotated transcript string.
Recommended Workflow
+-----------------------+
| Upload Recording |
| (.wav or .mp4) |
+-----------------------+
|
v
+-----------------------+
| Transcribe Recording |
| (command: transcribe)|
+-----------------------+
|
v
+-----------------------+
| Extract Emotions |
| (Audio / Video) |
|
v
+-----------------------+
| Merge Emotions with |
| LLM |
|
v
+-----------------------+
| Annotated Transcript |
+-----------------------+
OR
+-----------------------+
| Upload Keyboard Video |
| (video + timestamps) |
+-----------------------+
|
v
+-------------------------------+
| Transcribe Keyboard Video |
| (command: transcribe_keyboard_|
| video_input) |
+-------------------------------+
|
v
+-----------------------+
| Extract Emotions |
| (Audio / Video) |
|
v
+-----------------------+
| Merge Emotions with |
| LLM |
|
v
+-----------------------+
| Annotated Transcript |
+-----------------------+