Tool calling ipex-llm sur Intel Arc : guide pratique

13.11.2025 13:37 | par jpic | python dev ai

Ce guide montre comment lancer un conteneur Docker Intel ipex-llm, démarrer un serveur API vLLM configuré pour le modèle Qwen3-8B avec des capacités de tool calling, interroger le serveur avec curl, et interpréter une réponse d’exemple. Ensuite, nous développerons une boucle d’agent simple avec tool calling en utilisant l’excellente bibliothèque litellm.

Cette configuration permet d’exécuter de grands modèles de langage (LLMs) sur des XPUs Intel avec des fonctionnalités comme le choix automatique d’outils et le parsing du raisonnement. Toutes les commandes supposent un environnement Linux avec Docker installé et un accès au matériel Intel, par exemple via /dev/dri.

1. Télécharger le modèle sur l’hôte

Avant d’entrer dans le conteneur, téléchargez le modèle Qwen3-8B dans le répertoire de cache Hugging Face de votre hôte avec huggingface-cli. Cela garantit que le modèle est pré-téléchargé et disponible quand le conteneur monte le volume de cache, ce qui accélère le démarrage du serveur.

huggingface-cli download Qwen/Qwen3-8B

Cette commande nécessite que le package huggingface-hub soit installé sur l’hôte, via pip install huggingface-hub si nécessaire.

Les fichiers du modèle seront sauvegardés dans ~/.cache/huggingface/hub/, ou équivalent, qui est monté dans le conteneur à l’étape suivante.

Si vous rencontrez des problèmes d’authentification, connectez-vous avec huggingface-cli login en utilisant un token Hugging Face.

Une fois le téléchargement terminé, ce qui peut prendre du temps selon votre connexion, passez à l’étape suivante.

2. Entrer dans le conteneur ipex-llm

D’abord, démarrez un conteneur Docker interactif avec l’image intelanalytics/ipex-llm-serving-xpu:latest. Cela monte les répertoires nécessaires pour le cache des modèles et la persistance des données, alloue la mémoire et la mémoire partagée, et utilise le réseau de l’hôte pour simplifier.

docker run -it \
  -v ~/llm_root:/root \
  -v ~/.cache/huggingface/:/root/.cache/huggingface \
  --rm \
  --net=host \
  --device=/dev/dri \
  -v ~/.cache/huggingface/:/root/.cache/huggingface \
  --name ipex \
  --memory=40g \
  --shm-size 16g \
  --entrypoint bash \
  intelanalytics/ipex-llm-serving-xpu:latest

Explication des flags clés :

-v : monte les volumes pour le cache Hugging Face, afin de réutiliser les modèles téléchargés, et un répertoire llm_root personnalisé pour conserver un .bash_history.
--rm : supprime automatiquement le conteneur à la sortie.
--net=host : utilise la pile réseau de l’hôte, par exemple pour accéder à localhost:8000.
--device=/dev/dri : donne accès aux périphériques GPU Intel.
--memory=40g et --shm-size 16g : limite la mémoire et augmente la mémoire partagée pour les gros modèles.

Une fois dans le conteneur, passez à l’étape suivante.

3. Lancer le serveur API vLLM avec Qwen3 et le tool calling

Dans le conteneur, lancez le serveur API compatible OpenAI de vLLM avec le module ipex-llm. Cela charge le modèle Qwen/Qwen3-8B, active le tool calling avec un parsing de style Hermes, et le configure pour l’inférence XPU, c’est-à- dire GPU Intel.

python3 -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --model Qwen/Qwen3-8B \
  --served-model-name coder \
  --gpu-memory-utilization 0.4 \
  --device xpu \
  --dtype float16 \
  --max-model-len 4096 \
  --trust-remote-code \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --reasoning-parser deepseek_r1

Explication des arguments clés :

--model : spécifie le chemin du modèle Hugging Face, téléchargé s’il n’est pas en cache.
--served-model-name : alias du modèle dans les requêtes API, par exemple “coder”.
--gpu-memory-utilization 0.4 : réserve 40 % de la mémoire XPU pour le modèle.
--device xpu : cible le matériel Intel XPU.
--dtype float16 : utilise la demi-précision pour l’efficacité.
--max-model-len 4096 : limite la longueur de contexte à 4096 tokens.
--enable-auto-tool-choice : permet au modèle de décider quand appeler des outils.
--tool-call-parser hermes et --reasoning-parser deepseek_r1 : parseurs personnalisés pour les appels d’outils structurés et le raisonnement, par exemple les balises .

Le serveur démarrera sur http://localhost:8000 grâce au réseau hôte.

4. Interroger le serveur depuis un autre terminal

Dans un terminal séparé, en dehors du conteneur, envoyez une requête de chat completion via curl. Cet exemple demande au modèle d’utiliser un outil file_read pour lire un fichier nommé proc.py. Définissez le schéma de l’outil dans le payload de la requête.

curl http://localhost:8000/v1/chat/completions \
  -H 'content-type: application/json' \
  --data-raw '{
    "model": "coder",
    "messages": [
      {
        "type": "plain",
        "role": "user",
        "content": "use the file_read tool to read proc.py"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "file_read",
          "parameters": {
            "type": "object",
            "properties": {
              "path": {
                "type": "string",
                "description": "path of the file to read"
              }
            }
          }
        }
      }
    ]
  }'

Décomposition de la requête :

Endpoint : /v1/chat/completions, compatible OpenAI.
model : référence l’alias du modèle servi, “coder”.
messages : prompt utilisateur demandant l’utilisation d’un outil.
tools : définit les fonctions disponibles, ici file_read avec un paramètre path.

5. Exemple de réponse

Le serveur répond avec un objet JSON indiquant que le modèle a décidé d’appeler l’outil. Il inclut le raisonnement dans des balises , via le parseur, et un tableau tool_calls avec l’invocation de fonction.

{
  "id": "chatcmpl-20d2b9b36c8a48769afa3826b89cbdf6",
  "object": "chat.completion",
  "created": 1763052703,
  "model": "coder",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "<think>\nOkay, the user wants me to use the file_read tool to read the file proc.py. Let me check the tools provided. The available tool is file_read, which takes a path parameter. So I need to call file_read with the path set to \"proc.py\". I should make sure the arguments are correctly formatted as a JSON object. Alright, that's straightforward. Just specify the name as \"file_read\" and the arguments with \"path\": \"proc.py\".\n</think>\n\n",
        "tool_calls": [
          {
            "id": "chatcmpl-tool-f35bcf52b63344078076df301cd8514b",
            "type": "function",
            "function": {
              "name": "file_read",
              "arguments": "{\"path\": \"proc.py\"}"
            }
          }
        ]
      },
      "logprobs": null,
      "finish_reason": "tool_calls",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 154,
    "total_tokens": 274,
    "completion_tokens": 120,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

Points importants de la réponse :

content : contient le raisonnement interne, bloc , pour la transparence.
tool_calls : tableau d’appels de fonctions ; ici, il invoque file_read avec {"path": "proc.py"} comme arguments sérialisés en JSON.
finish_reason : “tool_calls” indique que la réponse s’est terminée à cause de l’invocation d’outil, sans texte final pour l’instant.
usage : décompte de tokens pour surveiller l’efficacité.

Cette configuration permet l’utilisation itérative d’outils : votre application exécuterait ensuite l’outil, par exemple lire le fichier, puis renverrait le résultat dans un message de suivi pour que le modèle continue. En production, pensez à ajouter de l’authentification et de la gestion d’erreurs.

6. Exemple de script Python LiteLLM

D’abord, installez litellm et cli2 avec pip, puis lancez ce script :

import json
import litellm

# feel free to enable this to have debug output (requests/responses) from litellm
# litellm._turn_on_debug()

# Defining a couple of functions for the LLM
def file_read(path):
    try:
        with open(path, 'r') as f:
            return f.read()
    except FileNotFound as exc:
        return str(exc)


def file_write(path, content):
    with open(path, 'w') as f:
        f.write(content)


# Map function name strings to actual function callbacks, we need this because
# the LLM will reply with function names
tool_callbacks = {
    'file_read': file_read,
    'file_write': file_write,
}

# OpenAI Tools description protocol
tools = [
  {
    "type": "function",
    "function": {
      "name": "file_read",
      "description": "Reads the entire contents of a file at the given path and returns it as a string. Use for loading text, configs, or data files.",
      "parameters": {
        "type": "object",
        "properties": {
          "path": {
            "type": "string",
            "description": "path of the file to read"
          },
        }
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "file_write",
      "description": "Writes the provided content to a file at the given path, overwriting if it exists. Use for saving outputs, configs, or generated data.",
      "parameters": {
        "type": "object",
        "properties": {
          "path": {
            "type": "string",
            "description": "path of the file to write"
          },
          "content": {
            "type": "string",
            "description": "content to write to the file"
          },
        }
      }
    }
  },
]

# Context we're starting off with
messages = [
  {
    "type": "plain",
    "role": "system",
    "content": """
    You are called by a command line agent, you will be asked to read and write
    files, which you must do by tool calling.
    """.strip(),
  },
  {
    "type": "plain",
    "role": "user",
    "content": """
    Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
    a new line to those contents with `1.1.1.1 example`.
    Once you have the new contents, write the new contents into /tmp/newhosts
    """.strip(),
  }
]

# we'll keep going until the llm has no function to call
while True:
    print({'querying with messages': messages})

    # call the llm with messages and tools
    response = litellm.completion(
        # litellm provider_name/model_name
        model='hosted_vllm/coder',
        # your docker endpoint
        api_base='http://localhost:8000/v1',
        messages=messages,
        tools=tools,
        temperature='.1',
    )

    # extract message from the response
    message = response.choices[0].message

    print({'LLM replied with message': message.to_dict()})

    # add the llm's response message to the context
    messages.append(message.to_dict())

    if not message.tool_calls:
        print('We are finished working')
        break

    for tool_call in message.tool_calls:
        # get the actual python function from the tool name using our map
        tool_callback = tool_callbacks[tool_call.function.name]

        # call the callback with the parsed arguments
        tool_result = tool_callback(**json.loads(tool_call.function.arguments))

        # add the tool result to the context
        messages.append({
            'role': 'tool',
            'tool_call_id': tool_call.id,
            'content': tool_result,
        })

Ensuite, vous verrez que /tmp/newhost contient le contenu de votre /etc/hosts original plus la nouvelle ligne 1.1.1.1 example.

Et cela produira :

querying with messages:
-   content: "You are called by a command line agent, you will be asked to read and write
             files, which you must do by tool calling."
    role: system
    type: plain
-   content: "Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
              a new line to those contents with `1.1.1.1 example`.
              Once you have the new contents, write the new contents into /tmp/newhosts"
    role: user
    type: plain

Ok, jusque-là nous avons envoyé nos messages de base. Réponse du LLM :

LLM replied with message:
    content: '<think>
        Okay, let''s see. The user wants me to read the /etc/hosts file, add a new line with "1.1.1.1 example", and then write the modified content to /tmp/newhosts.
        First, I need to use the file_read function to get the current contents of /etc/hosts. Once I have that, I can append the new line. Wait, but how exactly? The user says "add a new line to those contents", so I should take the existing content, add the new line at the end, and then write it to the new file.
        So step by step: Read /etc/hosts, then modify the content by adding "1.1.1.1 example" as a new line. Then write that modified content to /tmp/newhosts using file_write.
        I need to make sure that when I read the file, I get the entire content as a string. Then, I can append the new line. Let me check if there are any specific requirements for the format. The user didn''t mention any, so just adding the line at the end should be fine.
        So the first tool call is file_read with path /etc/hosts. Then, take the output, add the new line, and call file_write with path /tmp/newhosts and the content being the original content plus the new line.
        </think>'
    function_call: null
    role: assistant
    tool_calls:
    -   function:
            arguments: '{"path": "/etc/hosts"}'
            name: file_read
        id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
        type: function

Vous voyez comment il appelle la fonction file_read avec path=/etc/hosts ?! Notre code appelle cette fonction puis ajoute la sortie aux messages, donc nous interrogeons à nouveau le LLM dans une itération suivante de la boucle avec ces messages :

querying with messages:
-   content: "You are called by a command line agent, you will be asked to read and write
             files, which you must do by tool calling."
    role: system
    type: plain
-   content: "Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
              a new line to those contents with `1.1.1.1 example`.
              Once you have the new contents, write the new contents into /tmp/newhosts"
    role: user
    type: plain
-   content: '<think>
        [...snipped...]
        </think>
        '
    function_call: null
    role: assistant
    tool_calls:
    -   function:
            arguments: '{"path": "/etc/hosts"}'
            name: file_read
        id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
        type: function
-   content: "# Static table lookup for hostnames.\n# See hosts(5) for details.\n127.0.0.1 localhost ethlocal tzlocal tzkt-local\n104.26.10.87 www.ygg.re \n188.114.97.2 yggtorrent.top\n162.159.137.6 thepiratebay.org\n"
    role: tool
    tool_call_id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc

En gros, la nouvelle requête montre au LLM qu’il a déjà appelé l’outil file_read et obtenu sa sortie. Voyons ce qu’il répond maintenant :

LLM replied with message:
    content: "<think>
        Okay, I need to take the contents provided from the /etc/hosts file and add a new line with \"1.1.1.1 example\". Then, write the modified content to /tmp/newhosts.

        First, let me check the original content. The user provided the following lines:

        127.0.0.1 localhost ethlocal tzlocal tzkt-local
        104.26.10.87 www.ygg.re
        188.114.97.2 yggtorrent.top
        162.159.137.6 thepiratebay.org

        I need to append \"1.1.1.1 example\" as a new line. So I'll add that line at the end. Then, use the file_write function to save this to /tmp/newhosts.

        Wait, but how exactly should the new line be formatted? The original lines have varying numbers of spaces. For example, the first line has \"127.0.0.1 localhost ethlocal tzlocal tzkt-local\" with multiple entries. The new line should probably follow the same style, but the user specified \"1.1.1.1 example\" as the new line. So I'll add that as a new line at the end of the content.
        So combining the original content with the new line, then writing that to /tmp/newhosts. Let me make sure I don't miss any steps. The user wants the new content written to /tmp/newhosts, so after reading the original, I modify it by adding the new line, then write the entire modified content to the new file path.
        </think>"
    function_call: null
    role: assistant
    tool_calls:
    -   function:
            arguments: '{"path": "/tmp/newhosts", "content": "127.0.0.1 localhost ethlocal tzlocal tzkt-local\n104.26.10.87 www.ygg.re \n188.114.97.2 yggtorrent.top\n162.159.137.6 thepiratebay.org\n1.1.1.1 example"}'
            name: file_write
        id: chatcmpl-tool-400d7b6895f74a67871be31e0dbf258a
        type: function

Comme vous pouvez le voir, il a pris le contenu de /etc/hosts qu’il avait précédemment obtenu depuis la fonction file_read, a ajouté la ligne 1.1.1.1 example, et appelle file_write avec ce nouveau contenu dans /tmp/newhosts.

Donc, nous avons essentiellement terminé, mais la boucle continue, en interrogeant à nouveau avec l’ensemble complet des messages :

querying with messages:
-   content: "You are called by a command line agent, you will be asked to read and write
             files, which you must do by tool calling."
    role: system
    type: plain
-   content: "Read contents of /etc/hosts, once you have the contents of /etc/hosts, add
              a new line to those contents with `1.1.1.1 example`.
              Once you have the new contents, write the new contents into /tmp/newhosts"
-   content: '<think>[...]</think>'
    function_call: null
    role: assistant
    tool_calls:
    -   function:
            arguments: '{"path": "/etc/hosts"}'
            name: file_read
        id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc
        type: function

-   content: "# Static table lookup for hostnames.
              # See hosts(5) for details.
              127.0.0.1 localhost ethlocal tzlocal tzkt-local
              104.26.10.87 www.ygg.re
              188.114.97.2 yggtorrent.top
              162.159.137.6 thepiratebay.org"
    role: tool
    tool_call_id: chatcmpl-tool-5c9c2023ccea41b882dd376cd9957ccc

-   content: "<think>[...]</think>"
    function_call: null
    role: assistant
    tool_calls:
    -   function:
            arguments: '{"path": "/tmp/newhosts", "content": "127.0.0.1 localhost ethlocal tzlocal tzkt-local\n104.26.10.87 www.ygg.re \n188.114.97.2 yggtorrent.top\n162.159.137.6 thepiratebay.org\n1.1.1.1 example"}'
            name: file_write
        id: chatcmpl-tool-400d7b6895f74a67871be31e0dbf258a
        type: function
-   content: null
    role: tool
    tool_call_id: chatcmpl-tool-400d7b6895f74a67871be31e0dbf258a

Maintenant, le LLM répond avec :

LLM replied with message:
    content: '<think>
        Okay, let me see. The user asked to read /etc/hosts, add a line "1.1.1.1 example", and write to /tmp/newhosts. I first called file_read to get the hosts content. Then I appended the new line and used file_write to save it to /tmp/newhosts. The tool response for file_write was a success, so now I need to confirm the action.
        Wait, the user might expect a confirmation that the file was written. But since the task is done, I should just indicate that the new hosts file has been created. Let me check if there''s anything else needed. No, the steps are complete. The final answer should state that the new hosts file is written to `/tmp/newhosts`.
        </think>
        The new hosts file with the added entry has been written to `/tmp/newhosts`.'
    function_call: null
    role: assistant
    tool_calls: null

Comme vous pouvez le voir, c’est une réponse normale avec :

message content : The new hosts file with the added entry has been written to /tmp/newhosts
aucun appel d’outil

Comme il n’y a pas d’appel de fonction dans cette réponse, le programme affiche: We are finished working et quitte.

Tool calling sans tool calling

LiteLLM fournit une technique pour prendre en charge le tool calling quand l’API ne le prend pas en charge.

Ce n’est pas aussi génial, mais ça fonctionne correctement.

Conquérir le monde

Laissé en exercice au lecteur : connecter une imprimante 3D, exposer des outils pour que le LLM la pilote, et demander au LLM d’imprimer des armes et des robots pour aller conquérir le monde, ou quelque chose comme ça. Espérons des robots amicaux et du voyage dans le temps pour nous empêcher de faire ça.

Asta la vista baby!

(Si vous ne comprenez pas la blague, allez regarder Terminator maintenant !!)