FunctionGemma: How to Run & Fine-tune

FunctionGemma: How to Run & Fine-tune | Unsloth Documentation

FunctionGemma: How to Run & Fine-tune

Learn how to run and fine-tune FunctionGemma locally on your device and phone.

FunctionGemma is a new 270M parameter model by Google designed for function-calling and fine-tuning. Based on Gemma 3 270M and trained specifically for text-only tool-calling, its small size makes it great to deploy on your own phone.

You can run the full precision model on 550MB RAM (CPU) and you can now fine-tune it locally with Unsloth. Thank you to Google DeepMind for partnering with Unsloth for day-zero support!

Running Tutorial Fine-tuning FunctionGemma

Free Notebooks:

Google recommends these settings for inference:

maximum context length = 32,768

The chat template format is found when we use the below:

def get_today_date():
    """ Gets today's date """
    return {"today_date": "18 December 2025"}
    
tokenizer.apply_chat_template(
    [
        {"role" : "user", "content" : "what is today's date?"},
    ],
    tools = [get_today_date], add_generation_prompt = True, tokenize = False,
)

FunctionGemma chat template format:

FunctionGemma requires the system or developer message as You are a model that can do function calling with the following functions Unsloth versions have this pre-built in if you forget to pass one, so please use unsloth/functiongemma-270m-it

See below for a local desktop guide or you can view our Phone Deployment Guide.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

You can directly pull from Hugging Face. Because the model is so small, we'll be using the unquantized full-precision BF16 variant.

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose BF16 or other quantized versions (though it's not recommended to go lower than 4-bit) due to the small model size.

Then run the model in conversation mode:

You can also run and deploy FunctionGemma on your phone due to its small size. We collaborated with PyTorch to create a streamlined workflow using quantization-aware training (QAT) to recover 70% accuracy then deploying them directly to edge devices.

Deploy FunctionGemma locally to Pixel 8 and iPhone 15 Pro to get inference speeds of ~50 tokens/s
Get privacy first, instant responses and offline capabilities

📱Run LLMs on your Phone

View our iOS and Android Tutorials for deploying on your phone:

iOS Tutorial Android Tutorial

🦥 Fine-tuning FunctionGemma

Google noted that FunctionGemma is intended to be fine-tuned for your specific function-calling task, including multi-turn use cases. Unsloth now supports fine-tuning of FunctionGemma. We created 2 fine-tuning notebooks, which shows how you can train the model via full fine-tuning or LoRA for free via a Colab Notebook:

In the Reason before Tool Calling Fine-tuning notebook, we will fine-tune it "think/reason" before function calling. Chain-of-thought reasoning is becoming increasingly important for improving tool-use capabilities.

FunctionGemma is a small model specialized for function calling. It utilizes its own distinct chat template. When provided with tool definitions and a user prompt, it generates a structured output. We can then parse this output to execute the tool, retrieve the results, and use them to generate the final answer.

<start_of_turn>developer

You can do function calling with the following functions:

<start_function_declaration>declaration:get_weather{

description: "Get weather for city",

parameters: { city: STRING }

}

<end_function_declaration>

<end_of_turn>

<start_of_turn>user

What is the weather like in Paris?

<end_of_turn>

<start_of_turn>model

<start_function_call>call:get_weather{

city: "paris"

}

<end_function_call>

<start_function_response>response:get_weather{temperature:26}

<end_function_response>

The weather in Paris is 26 degrees Celsius.

<end_of_turn>

Here, we implement a simplified version using a single thinking block (rather than interleaved reasoning) via <think></think>. Consequently, our model interaction looks like this:

<start_of_turn>model

<think>

The user wants weather for Paris. I have the get_weather tool. I should call it with the city argument.

</think>

<start_function_call>call:get_weather{

city: "paris"

}

<end_function_call>

🪗Fine-tuning FunctionGemma for Mobile Actions

We also created a notebook to show how you can make FunctionGemma perform mobile actions. In the Mobile Actions Fine-tuning notebook, we enabled evaluation as well, and show how finetuning it for on device actions works well, as seen in the evaluation loss doing down:

For example given a prompt Please set a reminder for a "Team Sync Meeting" this Friday, June 6th, 2025, at 2 PM.

We fine-tuned the model to be able to output:

🏃‍♂️Multi Turn Tool Calling with FunctionGemma

We also created a notebook to show how you can make FunctionGemma do multi turn tool calls. In the Multi Turn tool calling notebook, we show how FunctionGemma is capable of calling tools in a long message change, for example see below:

You first have to specify your tools like below:

We then create a mapping for all the tools:

We also need some tool invocation and parsing code:

And now we can call the model!

Try the 3 notebooks we made for FunctionGemma:

<bos><start_of_turn>developer\nYou are a model that can do function calling with the following functions<start_function_declaration>declaration:get_today_date{description:<escape>Gets today's date<escape>,parameters:{type:<escape>OBJECT<escape>}}<end_function_declaration><end_of_turn>\n<start_of_turn>user\nwhat is today's date?<end_of_turn>\n<start_of_turn>model\n

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

./llama.cpp/llama-cli \
    -hf unsloth/functiongemma-270m-it-GGUF:BF16 \
    --jinja -ngl 99 --ctx-size 32768 \
    --top-k 64 --top-p 0.95 --temp 1.0

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/functiongemma-270m-it-GGUF",
    local_dir = "unsloth/functiongemma-270m-it-GGUF",
    allow_patterns = ["*BF16*"],
)

./llama.cpp/llama-cli \
    --model unsloth/functiongemma-270m-it-GGUF/functiongemma-270m-it-BF16.gguf \
    --ctx-size 32768 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --top-k 64 \
    --top-p 0.95 \
    --temp 1.0 \
    --jinja

[{'role': 'developer',
  'content': 'Current date and time given in YYYY-MM-DDTHH:MM:SS format: 2025-06-04T15:29:23\nDay of week is Wednesday\nYou are a model that can do function calling with the following functions\n',
  'tool_calls': None},
 {'role': 'user',
  'content': 'Please set a reminder for a "Team Sync Meeting" this Friday, June 6th, 2025, at 2 PM.',
  'tool_calls': None}]

<start_of_turn>user
Please set a reminder for a "Team Sync Meeting" this Friday, June 6th, 2025, at 2 PM.<end_of_turn>
<start_of_turn>model
<start_function_call>call:create_calendar_event{body:None,datetime:2025-06-06 14:00:00,email:None,first_name:None,last_name:None,phone_number:None,query:None,subject:None,title:<escape>Team Sync Meeting<escape>,to:None}<end_function_call><start_function_response>

def get_today_date():
    """
    Gets today's date

    Returns:
        today_date: Today's date in format 18 December 2025
    """
    from datetime import datetime
    today_date = datetime.today().strftime("%d %B %Y")
    return {"today_date": today_date}

def get_current_weather(location: str, unit: str = "celsius"):
    """
    Gets the current weather in a given location.

    Args:
        location: The city and state, e.g. "San Francisco, CA, USA" or "Sydney, Australia"
        unit: The unit to return the temperature in. (choices: ["celsius", "fahrenheit"])

    Returns:
        temperature: The current temperature in the given location
        weather: The current weather in the given location
    """
    if "San Francisco" in location.title():
        return {"temperature": 15, "weather": "sunny"}
    elif "Sydney" in location.title():
        return {"temperature": 25, "weather": "cloudy"}
    else:
        return {"temperature": 30, "weather": "rainy"}

def add_numbers(x: float | str, y: float | str):
    """
    Adds 2 numbers together

    Args:
        x: First number
        y: Second number

    Returns:
        result: x + y
    """
    return {"result" : float(x) + float(y)}

def multiply_numbers(x: float | str, y: float | str):
    """
    Multiplies 2 numbers together

    Args:
        x: First number
        y: Second number

    Returns:
        result: x * y
    """
    return {"result" : float(x) * float(y)}

FUNCTION_MAPPING = {
    "get_today_date" : get_today_date,
    "get_current_weather" : get_current_weather,
    "add_numbers": add_numbers,
    "multiply_numbers": multiply_numbers,
}
TOOLS = list(FUNCTION_MAPPING.values())

#@title FunctionGemma parsing code (expandible)
import re
def extract_tool_calls(text):
    def cast(v):
        try: return int(v)
        except:
            try: return float(v)
            except: return {'true': True, 'false': False}.get(v.lower(), v.strip("'\""))

    return [{
        "name": name,
        "arguments": {
            k: cast((v1 or v2).strip())
            for k, v1, v2 in re.findall(r"(\w+):(?:<escape>(.*?)<escape>|([^,}]*))", args)
        }
    } for name, args in re.findall(r"<start_function_call>call:(\w+)\{(.*?)\}<end_function_call>", text, re.DOTALL)]

def process_tool_calls(output, messages):
    calls = extract_tool_calls(output)
    if not calls: return messages
    messages.append({
        "role": "assistant",
        "tool_calls": [{"type": "function", "function": call} for call in calls]
    })
    results = [
        {"name": c['name'], "response": FUNCTION_MAPPING[c['name']](**c['arguments'])}
        for c in calls
    ]
    messages.append({ "role": "tool", "content": results })
    return messages

def _do_inference(model, messages, max_new_tokens = 128):
    inputs = tokenizer.apply_chat_template(
        messages, tools = TOOLS, add_generation_prompt = True, return_dict = True, return_tensors = "pt",
    )
    output = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens = False)

    out = model.generate(**inputs.to(model.device), max_new_tokens = max_new_tokens,
                         top_p = 0.95, top_k = 64, temperature = 1.0,)
    generated_tokens = out[0][len(inputs["input_ids"][0]):]
    return tokenizer.decode(generated_tokens, skip_special_tokens = True)
    
def do_inference(model, messages, print_assistant = True, max_new_tokens = 128):
    output = _do_inference(model, messages, max_new_tokens = max_new_tokens)
    messages = process_tool_calls(output, messages)
    if messages[-1]["role"] == "tool":
        output = _do_inference(model, messages, max_new_tokens = max_new_tokens)
    messages.append({"role": "assistant", "content": output})
    if print_assistant: print(output)
    return messages

from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Can choose any sequence length!
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/functiongemma-270m-it",
    max_seq_length = max_seq_length, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    load_in_16bit = True, # [NEW!] Enables 16bit LoRA
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

messages = []
messages.append({"role": "user", "content": "What's today's date?"})
messages = do_inference(model, messages, max_new_tokens = 128)