Skip to content

Conversation

@JamesMBartlett
Copy link
Collaborator

No description provided.

Copy link
Member

@MadcowD MadcowD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just want some more detail.


\section{Introduction}
\todo[inline]{Introduction to DDPG and recent advances in deep RL. }
[INSERT OPENING SENTENCE HERE] The current state-of-the-art in deep reinforcement learning is the Deep Deterministic Policy Gradient (DDPG) algorithm [\cite{lillicrap2015ddpg}] which expanded the deterministic policy gradient algorithm [\cite{silver2014dpg}] to continuous, high dimensional action spaces, with much success. The basic idea of DDPG is to use an actor-critic algorithm based on the DPG algorithm, where the critic $Q(s, a)$ is learned as in deep Q network learning [\cite{mnih2013dqn}], which is a model-free learning regime, and the actor $\mu(s)$ is updated based on sampling the policy gradient from [\cite{silver2014dpg}]. This algorithm had success comparable to planning based solvers on many physical control problems.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but I would then add a possibly second paragraph describing the downsides of this algorithm. *We need to motivate the rest of the paper! * Could be about: 1. Divergence, 2. Hyper parameter instability ( some $\gamma$s work and others do not, paradigmatically the method requires a lot of tuning, obviously you need to cite evidence for this argument). 3. The replay buffer is hacky, try and deconstruct the reasons for why it's use is essential for the DDPG algorithm

@MadcowD
Copy link
Member

MadcowD commented Oct 8, 2016

@JamesMBartlett Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants