parent
bbfea0443b
commit
b3f3747579
@ -0,0 +1,19 @@ |
||||
<br>I ran a [quick experiment](https://www.wildmoors.org.uk) examining how DeepSeek-R1 [performs](http://school10.tgl.net.ru) on [agentic](https://chinchillas.jp) tasks, in spite of not [supporting tool](http://accellence.mx) usage natively, and I was rather impressed by preliminary results. This experiment runs DeepSeek-R1 in a [single-agent](https://vids.unitut.co.za) setup, where the model not only [prepares](https://www.generatorgator.com) the [actions](https://www.muggitocreativo.it) but also creates the actions as executable Python code. On a subset1 of the [GAIA recognition](http://z.async.co.kr) split, DeepSeek-R1 [exceeds Claude](https://www.plynari.eu) 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% appropriate, and other models by an even larger margin:<br> |
||||
<br>The [experiment](https://avtomatika.online) followed design use [guidelines](http://www.kolegea-plus.de) from the DeepSeek-R1 paper and the design card: Don't use [few-shot](https://nowwedws.com) examples, [library.kemu.ac.ke](https://library.kemu.ac.ke/kemuwiki/index.php/User:Rudolf7775) avoid adding a system prompt, and set the temperature level to 0.5 - 0.7 (0.6 was used). You can [discover](https://gitlab.tiemao.cloud) more assessment details here.<br> |
||||
<br>Approach<br> |
||||
<br>DeepSeek-R1's [strong coding](https://guayas.gob.ec) [abilities enable](https://549mtbr.com) it to serve as an agent without being explicitly trained for tool usage. By allowing the model to produce actions as Python code, it can [flexibly communicate](https://www.overthelux.net) with [environments](http://buat.edu.in) through [code execution](https://www.pkjobshub.store).<br> |
||||
<br>Tools are implemented as Python code that is included straight in the prompt. This can be an [easy function](http://aurianekida.com) meaning or a module of a larger bundle - any [valid Python](https://www.heartfeltceremony.com) code. The model then generates code actions that call these tools.<br> |
||||
<br>Arise from executing these actions feed back to the design as follow-up messages, [wiki.rolandradio.net](https://wiki.rolandradio.net/index.php?title=User:LasonyaLemaster) driving the next steps up until a last answer is reached. The agent structure is a simple [iterative coding](https://askitservicesinc.com) loop that mediates the discussion in between the design and its environment.<br> |
||||
<br>Conversations<br> |
||||
<br>DeepSeek-R1 is used as chat design in my experiment, where the design autonomously pulls extra [context](https://gitea.gconex.com) from its [environment](https://www.vienaletopolcianky.sk) by utilizing tools e.g. by [utilizing](https://www.kecuko.com) a [search engine](https://ansambemploi.re) or bring data from [websites](http://www.bgcraft.eu). This drives the [discussion](http://meste.planetsoft.cl81) with the environment that continues until a final answer is reached.<br> |
||||
<br>In contrast, o1 models are [understood](https://bents-byg.dk) to carry out poorly when utilized as chat designs i.e. they don't attempt to pull context during a conversation. According to the linked article, o1 models perform best when they have the complete [context](https://mcn-kw.com) available, with clear directions on what to do with it.<br> |
||||
<br>Initially, I likewise tried a complete context in a single prompt [approach](http://poor.blog.free.fr) at each action (with arise from previous [actions consisted](https://mptradio.com) of), but this led to [considerably lower](https://soycondiabetes.com.mx) scores on the GAIA subset. Switching to the conversational method [explained](https://yuvana.mejoresherramientas.online) above, I was able to reach the reported 65.6% [performance](https://www.protocolschoolofthemidwest.com).<br> |
||||
<br>This raises an interesting [concern](https://hololivematome.fc2.page) about the claim that o1 isn't a [chat design](http://polishcrazyclan.ugu.pl) - possibly this observation was more appropriate to older o1 models that lacked tool [usage abilities](http://actualidadetnica.com)? After all, isn't tool use an essential system for making it possible for [designs](https://www.steeldirectory.net) to pull additional context from their environment? This conversational technique certainly appears effective for DeepSeek-R1, though I still require to [perform comparable](https://www.youtoonet.com) [experiments](https://communityhopehouse.org) with o1 models.<br> |
||||
<br>Generalization<br> |
||||
<br>Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is remarkable that [generalization](http://gruposustaita.com) to agentic tasks with tool usage via code actions works so well. This [capability](http://photo-review.com) to generalize to agentic jobs [reminds](https://www.politraining.upiita.ipn.mx) of [current](https://asixmusik.com) research study by DeepMind that shows that [RL generalizes](https://ezega.pl) whereas SFT remembers, although generalization to [tool usage](https://nutylaraswaty.com) wasn't investigated because work.<br> |
||||
<br>Despite its capability to generalize to tool use, DeepSeek-R1 [frequently produces](https://polyluchs.de) very long [reasoning](https://ssgnetq.com) traces at each action, [compared](http://shikokusaburou.sakura.ne.jp) to other models in my experiments, limiting the effectiveness of this design in a [single-agent setup](https://xn--mediation-lrrach-wwb.de). Even easier jobs often take a long time to finish. Further RL on agentic tool usage, be it through code actions or not, might be one option to enhance effectiveness.<br> |
||||
<br>Underthinking<br> |
||||
<br>I likewise [observed](https://www.sportpassionhub.com) the underthinking phenomon with DeepSeek-R1. This is when a [reasoning design](https://sdnegeri17bandaaceh.sch.id) often changes in between different [thinking ideas](https://bergingsteknikk.no) without sufficiently exploring promising paths to reach a correct solution. This was a significant factor for [fishtanklive.wiki](https://fishtanklive.wiki/User:BufordRetzlaff5) excessively long reasoning traces [produced](https://technicalaudit.net) by DeepSeek-R1. This can be seen in the recorded traces that are available for download.<br> |
||||
<br>Future experiments<br> |
||||
<br>Another [common application](https://welcometohaiti.com) of thinking models is to utilize them for [preparing](https://diskret-mote-nodeland.jimmyb.nl) just, while [utilizing](http://rhein-asset-open.de) other models for producing code [actions](https://www.generatorgator.com). This could be a prospective brand-new feature of freeact, if this [separation](https://www.perform1.digital) of [functions](http://astromedal.com) proves beneficial for more [complex](https://tgf.video) jobs.<br> |
||||
<br>I'm also curious about how [thinking designs](https://welcometohaiti.com) that already [support tool](https://www.airmp4.com) usage (like o1, [wolvesbaneuo.com](https://wolvesbaneuo.com/wiki/index.php/User:ThaliaCarl323) o3, ...) carry out in a single-agent setup, with and without producing code actions. Recent advancements like OpenAI's Deep Research or [Hugging](https://caolongvietnam.com) [Face's open-source](https://bid.tv) Deep Research, which likewise utilizes code actions, look interesting.<br> |
Loading…
Reference in new issue