dynv-6

I ran a quick experiment examining how DeepSeek-R1 performs on agentic tasks, in spite of not supporting tool usage natively, and I was rather impressed by preliminary results. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not only prepares the actions but also creates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 exceeds Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other models by an even larger margin:

The experiment followed design use guidelines from the DeepSeek-R1 paper and the design card: Don't use few-shot examples, library.kemu.ac.ke avoid adding a system prompt, and set the temperature level to 0.5 - 0.7 (0.6 was used). You can discover more assessment details here.

Approach

DeepSeek-R1's strong coding abilities enable it to serve as an agent without being explicitly trained for tool usage. By allowing the model to produce actions as Python code, it can flexibly communicate with environments through code execution.

Tools are implemented as Python code that is included straight in the prompt. This can be an easy function meaning or a module of a larger bundle - any valid Python code. The model then generates code actions that call these tools.

Arise from executing these actions feed back to the design as follow-up messages, wiki.rolandradio.net driving the next steps up until a last answer is reached. The agent structure is a simple iterative coding loop that mediates the discussion in between the design and its environment.

Conversations

DeepSeek-R1 is used as chat design in my experiment, where the design autonomously pulls extra context from its environment by utilizing tools e.g. by utilizing a search engine or bring data from websites. This drives the discussion with the environment that continues until a final answer is reached.

In contrast, o1 models are understood to carry out poorly when utilized as chat designs i.e. they don't attempt to pull context during a conversation. According to the linked article, o1 models perform best when they have the complete context available, with clear directions on what to do with it.

Initially, I likewise tried a complete context in a single prompt approach at each action (with arise from previous actions consisted of), but this led to considerably lower scores on the GAIA subset. Switching to the conversational method explained above, I was able to reach the reported 65.6% performance.

This raises an interesting concern about the claim that o1 isn't a chat design - possibly this observation was more appropriate to older o1 models that lacked tool usage abilities? After all, isn't tool use an essential system for making it possible for designs to pull additional context from their environment? This conversational technique certainly appears effective for DeepSeek-R1, though I still require to perform comparable experiments with o1 models.

Generalization

Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is remarkable that generalization to agentic tasks with tool usage via code actions works so well. This capability to generalize to agentic jobs reminds of current research study by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool usage wasn't investigated because work.

Despite its capability to generalize to tool use, DeepSeek-R1 frequently produces very long reasoning traces at each action, compared to other models in my experiments, limiting the effectiveness of this design in a single-agent setup. Even easier jobs often take a long time to finish. Further RL on agentic tool usage, be it through code actions or not, might be one option to enhance effectiveness.

Underthinking

I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning design often changes in between different thinking ideas without sufficiently exploring promising paths to reach a correct solution. This was a significant factor for fishtanklive.wiki excessively long reasoning traces produced by DeepSeek-R1. This can be seen in the recorded traces that are available for download.

Future experiments

Another common application of thinking models is to utilize them for preparing just, while utilizing other models for producing code actions. This could be a prospective brand-new feature of freeact, if this separation of functions proves beneficial for more complex jobs.

I'm also curious about how thinking designs that already support tool usage (like o1, wolvesbaneuo.com o3, ...) carry out in a single-agent setup, with and without producing code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise utilizes code actions, look interesting.