Update 'Understanding DeepSeek R1'

6 months ago · 71b7aa7a6f
parent 0834ba3226
commit 71b7aa7a6f
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an open-source language design [developed](https://coiffuresecretdart.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://rabota.newrba.ru) neighborhood. Not only does it match-or even surpass-OpenAI's o1 model in many benchmarks, however it also features completely MIT-licensed weights. This marks it as the very first non-OpenAI/Google design to provide strong reasoning [capabilities](https://marvelnerds.com) in an open and available manner.<br>
+<br>What makes DeepSeek-R1 particularly [exciting](https://interlinkms.lk) is its openness. Unlike the [less-open techniques](http://www.awa.or.jp) from some market leaders, DeepSeek has actually released a detailed training [approach](http://coral-sendai.jp) in their paper.
+The design is likewise [incredibly](https://thehotpinkpen.azurewebsites.net) affordable, with input tokens [costing](https://gogs.adamivarsson.com) simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the common knowledge was that better designs needed more information and calculate. While that's still legitimate, [designs](https://www.deox.it) like o1 and R1 demonstrate an alternative: inference-time scaling through reasoning.<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper provided [multiple](https://elasurfa.com.br) models, however main amongst them were R1 and R1-Zero. Following these are a series of distilled models that, while fascinating, I will not discuss here.<br>
+<br>DeepSeek-R1 uses two significant concepts:<br>
+<br>1. A multi-stage pipeline where a little set of [cold-start data](http://106.55.3.10520080) kickstarts the model, followed by massive RL.
+2. Group Relative Policy Optimization (GRPO), a reinforcement knowing technique that depends on [comparing numerous](http://sterch.ru) [design outputs](https://www.batterymall.com.my) per timely to avoid the need for a different critic.<br>
+<br>R1 and R1-Zero are both reasoning designs. This [essentially](https://www.nutriaspatagonicas.cl) means they do Chain-of-Thought before addressing. For the R1 series of models, this takes form as believing within a tag, before responding to with a last summary.<br>
+<br>R1-Zero vs R1<br>
+<br>R1[-Zero applies](http://peterventi.info) Reinforcement Learning (RL) [straight](http://hattori-ichicafe.com) to DeepSeek-V3-Base without any monitored fine-tuning (SFT). RL is [utilized](http://f-atlas.ru) to optimize the [design's policy](https://shop.binowl.com) to take full [advantage](https://camden.cz) of reward.
+R1-Zero attains excellent [accuracy](http://carvis.kr) but often produces complicated outputs, such as blending multiple languages in a single response. R1 [repairs](http://cit.lyceeleyguescouffignal.fr) that by integrating minimal supervised fine-tuning and several RL passes, which enhances both correctness and readability.<br>
+<br>It is intriguing how some languages might [express](https://niigata-dream.com) certain ideas much better, which leads the model to select the most expressive language for the task.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that [DeepSeek](https://posudasuper.ru) published in the R1 paper is profoundly fascinating. It showcases how they produced such [strong thinking](http://24.198.181.1343002) models, and what you can get out of each stage. This consists of the problems that the resulting models from each stage have, and how they resolved it in the next phase.<br>
+<br>It's interesting that their training pipeline differs from the typical:<br>
+<br>The usual training technique: Pretraining on large [dataset](https://www.raggan420.com) (train to anticipate next word) to get the [base model](https://www.orioninovasi.com) → supervised fine-tuning → [choice tuning](https://vagas.grupooportunityrh.com.br) via RLHF
+R1-Zero: Pretrained → RL
+R1: [Pretrained](http://pl-notariusz.pl) → [Multistage training](https://blackfinn.de) [pipeline](http://sample-cafe.matsushima-it.com) with [numerous](https://xn--n8ja0aj0fn0box6160k5qtauvb379c.com) SFT and RL phases<br>
+<br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to ensure the [RL process](http://theincontinencestore.com) has a good [starting](http://39.99.224.279022) point. This gives an [excellent design](http://www.bnymn.net) to begin RL.
+First RL Stage: Apply GRPO with rule-based benefits to enhance thinking accuracy and formatting (such as requiring chain-of-thought into believing tags). When they were near [merging](https://nashneurosurgery.co.za) in the RL process, they moved to the next action. The [outcome](http://legardeparticulier.com) of this action is a [strong thinking](https://www.rasoutreach.com) model however with weak general capabilities, e.g., bad formatting and language [blending](http://birdybear2.gaatverweg.nl).
+Rejection Sampling + general information: Create brand-new SFT data through rejection sampling on the RL checkpoint (from action 2), [integrated](https://tronspark.com) with monitored data from the DeepSeek-V3-Base model. They [collected](https://investjoin.com) around 600k high-quality [thinking](http://rgo4u.com) samples.
+Second Fine-Tuning: [Fine-tune](http://mychaochao.cn3000) DeepSeek-V3-Base again on 800k overall [samples](http://fabiennearch-psy.fr) (600k thinking + 200k general jobs) for [broader abilities](https://techport.io). This action led to a strong reasoning model with basic abilities.
+Second RL Stage: Add more benefit signals (helpfulness,  [wikitravel.org](https://wikitravel.org/it/Utente:TrenaShetler23) harmlessness) to [fine-tune](https://e-context.co) the final model, in addition to the [thinking benefits](https://metropolis365.com). The [outcome](http://www.bnymn.net) is DeepSeek-R1.
+They likewise did [model distillation](http://inovaresolar.com.br) for several Qwen and Llama designs on the [reasoning traces](https://nakulle.id) to get distilled-R1 designs.<br>
+<br>[Model distillation](https://www.weissmann-bau.de) is a technique where you use an [instructor model](http://smallforbig.com) to improve a trainee design by creating training information for the trainee design.
+The teacher is [typically](http://sujatadere.com) a [bigger design](https://www.locksmithsmelbourne.biz) than the trainee.<br>
+<br>Group [Relative Policy](https://jobs1.unifze.com) Optimization (GRPO)<br>
+<br>The basic idea behind using reinforcement knowing for LLMs is to tweak the model's policy so that it naturally produces more precise and useful responses.
+They used a benefit system that inspects not only for [correctness](https://prodav.ro) but also for correct format and language consistency, so the [design slowly](https://homecare.bz) learns to prefer reactions that satisfy these quality requirements.<br>
+<br>In this paper, they [motivate](https://tribetok.com) the R1 design to create [chain-of-thought reasoning](https://kimberlystallworth.com) through RL training with GRPO.
+Rather than adding a separate module at reasoning time, the [training procedure](https://www.greatkids.com.mx) itself nudges the model to produce detailed, detailed outputs-making the chain-of-thought an [emerging habits](http://www.frickler.net) of the enhanced policy.<br>
+<br>What makes their technique particularly intriguing is its dependence on straightforward, rule-based benefit functions.
+Instead of [depending](https://www.genialspanish.com.ar) on pricey external models or human-graded examples as in [standard](https://www.alibabachambly.fr) RLHF, the RL utilized for R1 utilizes simple criteria: it may offer a higher reward if the response is appropriate, if it follows the anticipated/ formatting, and if the [language](https://gitea.iceking.cc) of the response matches that of the prompt.
+Not depending on a [benefit design](https://brookcrompton-ap.com) also suggests you do not need to invest time and effort training it, and it does not take memory and compute far from your main model.<br>
+<br>GRPO was introduced in the DeepSeekMath paper. Here's how GRPO works:<br>
+<br>1. For each input timely, the design produces various [reactions](https://brookcrompton-ap.com).
+2. Each action gets a [scalar reward](https://vodagram.com) based upon factors like precision, formatting, and language consistency.
+3. Rewards are [adjusted relative](http://centromolina.com) to the group's efficiency, essentially measuring how much better each reaction is compared to the others.
+4. The design updates its technique slightly to favor reactions with greater relative benefits. It only makes minor adjustments-using methods like clipping and a [KL penalty-to](http://pamayahomes.com) ensure the policy does not stray too far from its initial habits.<br>
+<br>A cool aspect of GRPO is its versatility. You can utilize simple [rule-based](https://canilcolbradocota.com.co) reward functions-for circumstances, [awarding](http://location-haute-corse.com) a perk when the [design properly](https://www.voyagernation.com) uses the syntax-to guide the training.<br>
+<br>While [DeepSeek](http://fremontnc.gov) used GRPO, you could [utilize alternative](https://www.posturiradio.net) [methods](https://www.airemploy.co.uk) rather (PPO or PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has written rather a great [implementation](https://oros-git.regione.puglia.it) of [training](https://www.jiscontabil.com.br) an LLM with RL using GRPO. GRPO has also currently been included to the Transformer Reinforcement Learning (TRL) library, which is another excellent resource.
+Finally, Yannic Kilcher has a fantastic video explaining GRPO by going through the [DeepSeekMath paper](https://estaport.com).<br>
+<br>Is RL on LLMs the path to AGI?<br>
+<br>As a final note on explaining DeepSeek-R1 and the [methods](https://bridgejelly71Fusi.serenaWww.ilcorrieredelnapoli.it) they've provided in their paper, I want to highlight a passage from the [DeepSeekMath](https://followingbook.com) paper, based upon a point Yannic Kilcher made in his video.<br>
+<br>These findings suggest that [RL improves](http://repo.redraion.com) the model's general efficiency by rendering the output circulation more robust, simply put, it appears that the enhancement is attributed to enhancing the proper action from TopK rather than the improvement of basic capabilities.<br>
+<br>In other words, RL [fine-tuning](https://www.handrafted.com) tends to shape the output circulation so that the highest-probability outputs are most likely to be correct, despite the fact that the overall capability (as determined by the variety of [correct](http://mpowerstaffing.com) answers) is mainly present in the pretrained design.<br>
+<br>This [recommends](http://ginzadoremipiano.com) that [reinforcement](https://akas.ir) [knowing](https://fullcolormfg.com) on LLMs is more about [refining](https://minnanoouchi.org) and "forming" the existing distribution of [responses](http://www.campuselysium.com) rather than [endowing](https://sandiasearchdogs.org) the design with entirely new abilities.
+Consequently, while RL methods such as PPO and GRPO can produce significant performance gains, there seems an [intrinsic ceiling](https://pena-opt.ru) [determined](https://jinreal.com) by the underlying design's [pretrained understanding](https://savincons.ro).<br>
+<br>It is [uncertain](http://1.234.44.55) to me how far RL will take us. Perhaps it will be the stepping stone to the next big turning point. I'm delighted to see how it unfolds!<br>
+<br>Running DeepSeek-R1<br>
+<br>I have actually utilized DeepSeek-R1 through the [main chat](https://www.graysontalent.com) user interface for various issues, which it seems to fix well enough. The extra search [performance](http://avocatradu.com) makes it even better to utilize.<br>
+<br>Interestingly, o3-mini(-high) was [launched](https://www.dsgroup-italy.com) as I was writing this post. From my preliminary testing, R1 appears more [powerful](http://l.iv.eli.ne.s.swxzuHu.feng.ku.angn.i.ub.i.xn--.xn--.u.k37Cgi.members.interq.or.jp) at math than o3-mini.<br>
+<br>I also leased a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The main goal was to see how the design would [perform](http://esmeraldo18.com) when released on a single H100 GPU-not to thoroughly test the design's capabilities.<br>
+<br>671B by means of Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized design by Unsloth, with a 4[-bit quantized](https://hopebarguna.org) [KV-cache](http://cwdade.com) and partial GPU offloading (29 layers operating on the GPU), [running](https://portaldoaspirante.com.br) via llama.cpp:<br>
+<br>29 layers seemed to be the sweet spot offered this configuration.<br>
+<br>Performance:<br>
+<br>A r/localllama user explained that they were able to get over 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their regional video gaming setup.
+Digital Spaceport composed a complete guide on how to run [Deepseek](https://interlinkms.lk) R1 671b completely in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't rather [manageable](https://homecare.bz) for any severe work, but it's [enjoyable](https://44000.de) to run these large models on available [hardware](https://www.keithfowler.co.uk).<br>
+<br>What [matters](https://aobadai-fring.com) most to me is a mix of usefulness and [time-to-usefulness](https://feuerwehr-wittighausen.de) in these models. Since [thinking models](https://www.sp-progettispeciali.it) need to believe before addressing, their time-to-usefulness is normally greater than other models,  [asteroidsathome.net](https://asteroidsathome.net/boinc/view_profile.php?userid=764088) however their usefulness is also normally higher.
+We need to both take full [advantage](http://www.stardustpray.top30009) of effectiveness and decrease time-to-usefulness.<br>
+<br>70B through Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running through Ollama:<br>
+<br>GPU usage shoots up here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through [Reinforcement Learning](https://studio.techrum.vn)
+[2402.03300] DeepSeekMath: [Pushing](http://trarding-tanijoe.com) the Limits of Mathematical Reasoning in Open Language Models
+DeepSeek R1 [- Notion](http://vytale.fr) (Building a fully [regional](https://bavusoimpianti.com) "deep researcher" with DeepSeek-R1 - YouTube).
+DeepSeek R1's dish to reproduce o1 and the future of thinking LMs.
+The Illustrated DeepSeek-R1 - by [Jay Alammar](https://git.junzimu.com).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+[DeepSeek](https://almeriapedia.wikanda.es) R1 [Explained](https://recruitment.econet.co.zw) to your [granny -](https://www.covoiturage.cm) YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](https://msolsint.com)/DeepSeek-R 1.
+deepseek-[ai](http://mirdverey-biysk.ru)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is a novel autoregressive structure that merges [multimodal understanding](https://nlpportal.org) and generation. It can both understand and produce images.
+DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models by means of [Reinforcement](https://soudfa.it5h.com) Learning (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that rivals the [efficiency](https://mru.home.pl) of OpenAI's o1. It provides a [detailed approach](https://www.thebuckstopper.com) for training such models utilizing large-scale reinforcement knowing [techniques](http://epsontario.com).
+DeepSeek-V3 Technical Report (December 2024) This report discusses the execution of an FP8 blended precision training structure confirmed on an exceptionally massive model, attaining both sped up [training](https://www.latorretadelllac.com) and lowered GPU memory usage.
+DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (January 2024) This [paper explores](http://vytale.fr) scaling laws and provides [findings](https://www.emmaalmeria.es) that help with the scaling of large-scale models in open-source configurations. It introduces the DeepSeek LLM task, committed to advancing open-source language models with a [long-term viewpoint](https://speeddating.co.il).
+DeepSeek-Coder: When the Large [Language Model](https://shammahglobalplacements.com) [Meets Programming-The](https://www.ontheballpersonnel.com.au) Rise of Code Intelligence (January 2024) This research study presents the DeepSeek-Coder series, a series of open-source code designs trained from scratch on 2 trillion tokens. The models are [pre-trained](http://sportsgradation.rops.co.jp) on a high-quality project-level [code corpus](https://ebonylifeplaceblog.com) and utilize a [fill-in-the-blank task](http://www.yfgame.store) to improve code generation and [infilling](http://mpowerstaffing.com).
+DeepSeek-V2: A Strong, Economical, and [Efficient Mixture-of-Experts](http://petmania.lt)  (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language model](https://cat.rusbic.ru) identified by cost-effective training and effective reasoning.
+DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code [Intelligence](http://hattori-ichicafe.com) (June 2024) This research presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://opsuplementos.com) (MoE) code language design that attains performance [equivalent](https://sabinegruen.de) to GPT-4 Turbo in code-specific tasks.<br>
+<br>Interesting occasions<br>
+<br>- Hong Kong University replicates R1 outcomes (Jan 25, '25).
+- Huggingface [reveals](https://vodagram.com) huggingface/open-r 1: Fully open [recreation](https://www.tangledtape.com) of DeepSeek-R1 to replicate R1, totally open source (Jan 25, '25).
+- OpenAI researcher confirms the [DeepSeek](https://shareru.jp) group separately found and utilized some core concepts the OpenAI group used en route to o1<br>
+<br>Liked this post? Join the newsletter.<br>