zazo2002 commited on
Commit
227b6b0
·
1 Parent(s): 2e972a5

first commit

Browse files
Files changed (4) hide show
  1. .DS_Store +0 -0
  2. .gitattributes +3 -0
  3. app.py +466 -0
  4. requirements.txt +9 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.png filter=lfs diff=lfs merge=lfs -text
37
+ *.jpg filter=lfs diff=lfs merge=lfs -text
38
+ *.mov filter=lfs diff=lfs merge=lfs -text
app.py ADDED
@@ -0,0 +1,466 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+
4
+ # Paper links and descriptions for each algorithm
5
+ # Implementation notes for algorithms on various environments
6
+ implementation_info = {
7
+ "CartPole-v1_PPO": """
8
+ ### 🛠️ Implementation Challenges for PPO on CartPole
9
+
10
+ 1. **Discrete Actions Handling**: CartPole uses discrete actions (left/right), so we had to use a `Categorical` distribution instead of `Normal`. This changes how actions are sampled and log-probabilities are computed.
11
+ 2. **Shared Network**: We used a single neural network with shared layers for both actor and critic, which helps with parameter efficiency but can lead to interference if not tuned well.
12
+ 3. **Advantage Estimation**: We calculated advantages using the simple difference `returns - values`, and normalized them to stabilize training.
13
+ 4. **Multiple Epoch Updates**: PPO requires updating the same batch several times. We had to carefully manage log probabilities and ratios to ensure stable learning.
14
+ 5. **Gym Compatibility**: Recent changes in the Gym API required handling tuples when resetting or stepping the environment.
15
+ 6. **Video Recording**: Gym's rendering had to be accessed using `render(mode='rgb_array')`, and OpenCV needed proper BGR conversion and resizing.
16
+
17
+ Despite being simpler than continuous control, PPO on CartPole still demanded precision in batching, advantage computation, and log-prob tracking.
18
+
19
+ ### 📊 Hyperparameter Impact Analysis
20
+
21
+ *Note: Detailed hyperparameter experiments were conducted on this environment, with insights applicable to other discrete control tasks.*
22
+
23
+ Our hyperparameter tuning experiments revealed several key insights:
24
+
25
+ 1. **Learning Rate (LR)**:
26
+ - Higher learning rate (0.01) led to significantly faster convergence
27
+ - Lower learning rate (0.0005) struggled to reach the solving threshold
28
+
29
+ 2. **Discount Factor (GAMMA)**:
30
+ - Higher discount (0.999) had more variance but eventually solved
31
+ - Lower discount (0.90) learned quickly initially but had stability issues
32
+
33
+ 3. **Clipping Range (EPS_CLIP)**:
34
+ - Both values (0.1 and 0.3) solved successfully
35
+ - Higher clipping (0.3) had slightly better early performance
36
+
37
+ 4. **Update Epochs (K_EPOCH)**:
38
+ - Lower value (1) struggled with learning speed
39
+ - Higher value (10) solved very quickly, showing more updates help
40
+ """,
41
+ "MountainCarContinuous-v0_PPO": """
42
+ ### 🛠️ Implementation Challenges for PPO on MountainCarContinuous
43
+
44
+ 1. **Continuous Action Sampling**: We had to use a `Normal` distribution instead of `Categorical`, introducing the need to manage `action_std` and diagonal covariance matrices.
45
+ 2. **Action Standard Deviation Decay**: To reduce exploration over time, we decayed `action_std` every 200 episodes to help the agent converge.
46
+ 3. **Generalized Advantage Estimation (GAE)**: We implemented GAE to reduce variance in advantage estimates using a lambda-weighted future reward structure.
47
+ 4. **Separate Actor/Critic Networks**: Continuous actions benefited from separate actor and critic networks for better learning stability.
48
+ 5. **Entropy Regularization**: We added an entropy bonus to encourage exploration, which was essential in early episodes where rewards are sparse.
49
+ 6. **Gym Compatibility + Video Capture**: Gym's new step API required checking `terminated` and `truncated`, and video recording had to handle raw RGB frames with OpenCV.
50
+
51
+ MountainCarContinuous was trickier than CartPole due to continuous actions and sparse rewards — we had to introduce action variance decay and GAE to learn successfully.
52
+
53
+ ### 📊 Hyperparameter Impact Analysis
54
+
55
+ *Note: Detailed hyperparameter experiments were conducted on this environment, with insights applicable to other continuous control tasks.*
56
+
57
+ Our hyperparameter tuning experiments revealed several key insights:
58
+
59
+ 1. **Action Standard Deviation**:
60
+ - Higher value (0.80) led to faster convergence by enabling greater exploration
61
+ - Lower value (0.40) resulted in much slower learning due to limited exploration
62
+
63
+ 2. **Clipping Parameter (EPS_CLIP)**:
64
+ - Lower value (0.10) enabled faster learning and quicker convergence
65
+ - Higher value (0.30) still solved the environment but took longer
66
+
67
+ 3. **Training Epochs**:
68
+ - Higher value (20) dramatically improved learning speed, solving in ~300 episodes
69
+ - Lower value (5) struggled to make progress, highlighting the importance of sufficient updates
70
+
71
+ 4. **GAE Lambda**:
72
+ - Lower value (0.80) significantly improved learning speed, solving in ~400 episodes
73
+ - Higher value (0.99) resulted in slower, more stable but less efficient learning
74
+
75
+ 5. **Discount Factor (GAMMA)**:
76
+ - Higher value (0.999) led to faster convergence by focusing on long-term returns
77
+ - Lower value (0.90) resulted in slower learning due to shortsighted optimization
78
+
79
+ 6. **Actor Learning Rate**:
80
+ - Higher value (0.001) enabled faster policy updates and quicker convergence
81
+ - Lower value (0.0001) resulted in slower but more stable learning
82
+ """,
83
+ "MountainCarContinuous-v0_SAC": """
84
+ ### 🛠️ Implementation Challenges for SAC on MountainCarContinuous
85
+
86
+ 1. **Entropy Maximization**: Implementing the entropy term required careful balancing to ensure enough exploration without sacrificing performance.
87
+ 2. **Twin Critics**: We needed two separate Q-networks to mitigate overestimation bias, requiring careful management of target networks.
88
+ 3. **Automatic Temperature Tuning**: To automatically adjust the entropy coefficient, we had to implement a separate optimization process.
89
+ 4. **Replay Buffer Management**: Efficient experience replay was crucial for off-policy learning in this sparse reward environment.
90
+ 5. **Reward Scaling**: The large +100 reward for reaching the goal needed proper scaling to stabilize training.
91
+ 6. **Action Squashing**: Ensuring actions fell within the environment limits using tanh and proper log probability calculations.
92
+ 7. **Reward Shaping**: Unlike the standard SAC implementation, reward shaping was necessary to guide exploration in this sparse reward environment.
93
+
94
+ SAC's entropy maximization helped solve the exploration challenges in MountainCarContinuous where traditional methods struggle.
95
+
96
+ ### 📊 Hyperparameter Impact Analysis
97
+
98
+ *Note: Detailed hyperparameter experiments were conducted on this environment, with insights applicable to other continuous control tasks.*
99
+
100
+ Our comprehensive hyperparameter study revealed critical insights:
101
+
102
+ 1. **Target Update Rate (τ)**:
103
+ - Lower values (0.005) provided excellent stability and fastest convergence around episode 20
104
+ - Medium values (0.01) showed good performance but slightly slower convergence
105
+ - Higher values (0.02) led to more volatile learning and delayed convergence
106
+
107
+ 2. **Learning Rate**:
108
+ - Higher learning rate (0.001) achieved fastest convergence and most stable performance
109
+ - Medium rate (0.0006) showed good but slower convergence
110
+ - Lower rate (0.0003) struggled significantly, taking much longer to reach optimal performance
111
+
112
+ 3. **Temperature Parameter (α)**:
113
+ - Lower values (0.1) led to fastest and most stable convergence, reaching ~95 reward consistently
114
+ - Medium values (0.5) showed competitive performance but with more variability
115
+ - Higher values (0.9) resulted in significantly slower learning and lower asymptotic performance
116
+
117
+ 4. **Discount Factor (γ)**:
118
+ - Higher values (0.995) demonstrated fastest convergence and excellent stability
119
+ - Medium values (0.99) showed good performance but slower initial learning
120
+ - Lower values (0.95) struggled with long-term planning, achieving lower final performance
121
+
122
+ **Key Finding**: SAC showed remarkable sensitivity to hyperparameter choices, with τ=0.005, LR=0.001, α=0.1, and γ=0.995 providing optimal performance.
123
+ """,
124
+ "Pendulum-v1_SAC": """
125
+ ### 🛠️ Implementation Challenges for SAC on Pendulum
126
+
127
+ 1. **Continuous Torque Control**: Managing the continuous action space (-2 to 2) required proper scaling and action bounds.
128
+ 2. **Negative Rewards**: Pendulum's negative rewards required careful Q-value initialization to avoid pessimistic starts.
129
+ 3. **Dense Reward Function**: Unlike sparse reward environments, we needed to tune hyperparameters to handle the frequent feedback.
130
+ 4. **Temperature Parameter Tuning**: Finding the right entropy coefficient was critical for balancing exploration and exploitation.
131
+ 5. **Neural Network Architecture**: The relatively simple state space allowed for smaller networks, but required tuning layer sizes.
132
+ 6. **Target Network Updates**: We used soft updates with polyak averaging to ensure stable learning.
133
+
134
+ SAC's ability to balance exploration and exploitation made it well-suited for the Pendulum's continuous control problem with its dense reward feedback.
135
+
136
+ ### 📊 Hyperparameter Impact Analysis
137
+
138
+ *Note: Hyperparameter analysis was conducted on MountainCarContinuous-v0 and the insights apply to both environments due to similar continuous control characteristics.*
139
+
140
+ The hyperparameter insights from MountainCarContinuous transfer well to Pendulum:
141
+
142
+ 1. **Target Update Rate (τ)**: Lower values (0.005) provide better stability for continuous control
143
+ 2. **Learning Rates**: Higher learning rates (0.001) enable faster convergence in both environments
144
+ 3. **Temperature Parameter (α)**: Lower values (0.1) balance exploration-exploitation effectively
145
+ 4. **Discount Factor (γ)**: Higher values (0.995) support better long-term planning in both tasks
146
+ """,
147
+ "MountainCarContinuous-v0_TD3": """
148
+ ### 🛠️ Implementation Challenges for TD3 on MountainCarContinuous
149
+
150
+ 1. **Twin Delayed Critics**: Implementing two critic networks and delaying policy updates required careful synchronization.
151
+ 2. **Target Policy Smoothing**: Adding noise to target actions helped prevent exploitation of Q-function errors.
152
+ 3. **Delayed Policy Updates**: Updating the policy less frequently than the critics required tracking update steps.
153
+ 4. **Sparse Rewards**: The sparse reward structure of MountainCar required extended exploration periods.
154
+ 5. **Action Bounds**: Ensuring actions stayed within [-1, 1] while calculating proper gradients needed special handling.
155
+ 6. **Initialization Strategies**: Proper weight initialization was critical for stable learning in this environment.
156
+
157
+ TD3's conservative policy updates and overestimation bias mitigation proved effective for the challenging MountainCarContinuous task.
158
+
159
+ ### 📊 Hyperparameter Impact Analysis
160
+
161
+ *Note: Hyperparameter analysis was conducted on Pendulum-v1 and the insights apply to both environments due to similar continuous control characteristics.*
162
+
163
+ TD3 required careful tuning for continuous control environments:
164
+
165
+ 1. **Policy Noise**: Higher noise values improved exploration in sparse reward environments
166
+ 2. **Target Update Frequency**: Delayed policy updates (every 2 critic updates) provided stability
167
+ 3. **Learning Rates**: Balanced actor and critic learning rates were crucial for convergence
168
+ 4. **Exploration Strategy**: For MountainCarContinuous, reward shaping was necessary to guide initial exploration
169
+ """,
170
+ "Pendulum-v1_TD3": """
171
+ ### 🛠️ Implementation Challenges for TD3 on Pendulum
172
+
173
+ 1. **Exploration Strategy**: Balancing exploration noise magnitude was crucial for the pendulum's sensitive control.
174
+ 2. **Clipped Double Q-learning**: Implementing the minimum of two critics required careful tensor operations.
175
+ 3. **Target Networks**: Managing four separate networks (two critics, two targets) required organized code structure.
176
+ 4. **Delayed Policy Updates**: Synchronizing updates at the right frequency was important for stability.
177
+ 5. **Reward Scaling**: Pendulum's large negative rewards needed normalization to prevent value function saturation.
178
+ 6. **Network Sizes**: Finding the right network capacity for both actor and critics affected learning speed.
179
+
180
+ TD3's focus on stable learning made it effective for Pendulum, where small action differences can lead to very different outcomes.
181
+
182
+ ### 📊 Hyperparameter Impact Analysis
183
+
184
+ *Note: Detailed hyperparameter experiments were conducted on this environment, with insights applicable to other continuous control tasks.*
185
+
186
+ Our hyperparameter experiments on Pendulum revealed:
187
+
188
+ 1. **Policy Noise (σ)**:
189
+ - Lower noise accelerated convergence and increased stability
190
+ - Higher noise provided more exploration but slower convergence
191
+
192
+ 2. **Target Update Rate (τ)**:
193
+ - Lower τ values led to slower but more stable convergence
194
+ - Higher τ values enabled faster learning with acceptable stability
195
+
196
+ 3. **Actor Learning Rate**:
197
+ - Standard rate provided balanced learning speed and stability
198
+ - Higher rates led to instability, lower rates slowed convergence
199
+
200
+ 4. **Critic Learning Rate**:
201
+ - Similar patterns to actor learning rate
202
+ - Twin critics benefited from synchronized learning rates
203
+ """,
204
+ "MountainCar-v0_DQN": """
205
+ ### 🛠️ Implementation Challenges for DQN on MountainCar (Discrete)
206
+
207
+ 1. **Discretized Action Space**: Working with the limited discrete actions (left, neutral, right) required effective exploration.
208
+ 2. **Sparse Rewards**: The sparse reward structure meant the agent received almost no feedback until reaching the goal.
209
+ 3. **Experience Replay**: Implementing a replay buffer to break correlations in the observation sequence was crucial.
210
+ 4. **Target Network Updates**: Hard updates to the target network required careful timing to balance stability and learning speed.
211
+ 5. **Epsilon Decay**: Finding the right exploration schedule was essential for the agent to discover the momentum-building strategy.
212
+ 6. **Double DQN**: We implemented Double DQN to reduce overestimation bias, which was important for stable learning.
213
+
214
+ DQN required careful tuning to overcome the exploration challenges in MountainCar's sparse reward setting.
215
+
216
+ ### 📊 Hyperparameter Impact Analysis
217
+
218
+ *Note: Hyperparameter analysis was conducted on CartPole-v1 and the insights apply to both discrete environments due to similar DQN architecture requirements.*
219
+
220
+ DQN's performance was highly sensitive to hyperparameter choices:
221
+
222
+ 1. **Learning Rate**: Balanced rates provided steady convergence without instability
223
+ 2. **Epsilon Decay**: Gradual decay from 1.0 to 0.01 over episodes enabled sufficient exploration
224
+ 3. **Replay Buffer Size**: Large buffer helped provide diverse experiences for breaking correlations
225
+ 4. **Target Network Update**: Regular updates balanced stability with learning speed
226
+ """,
227
+ "CartPole-v1_DQN": """
228
+ ### 🛠️ Implementation Challenges for DQN on CartPole
229
+
230
+ 1. **Binary Action Selection**: Implementing efficient Q-value calculation for the two discrete actions (left/right).
231
+ 2. **Reward Discount Tuning**: Finding the right gamma value for this task with potentially long episodes.
232
+ 3. **Network Architecture**: Balancing network capacity with training stability for this relatively simple task.
233
+ 4. **Epsilon Annealing**: Creating an effective exploration schedule that transitions from exploration to exploitation.
234
+ 5. **Replay Buffer Size**: Tuning the memory size to balance between recent and diverse experiences.
235
+ 6. **Update Frequency**: Determining how often to update the target network to maintain stability.
236
+
237
+ DQN's ability to learn value functions directly made it effective for CartPole, though careful exploration strategy was still necessary.
238
+
239
+ ### 📊 Hyperparameter Impact Analysis
240
+
241
+ *Note: Detailed hyperparameter experiments were conducted on this environment, with insights applicable to other discrete control tasks.*
242
+
243
+ DQN demonstrated robust performance on CartPole across different hyperparameter settings:
244
+
245
+ 1. **Learning Rate**: Higher rates led to faster convergence, lower rates were more stable
246
+ 2. **Batch Size**: Medium batch sizes (64) provided good balance of gradient quality and computational efficiency
247
+ 3. **Network Architecture**: Two hidden layers with 128 units each proved sufficient for this task
248
+ 4. **Replay Buffer**: 100,000 transitions provided adequate experience diversity
249
+ """
250
+ }
251
+
252
+ algo_info = {
253
+ "PPO": {
254
+ "description": "Proximal Policy Optimization (PPO) is a policy gradient method that uses a clipped surrogate objective to ensure stable and efficient updates.",
255
+ "paper": "https://arxiv.org/abs/1707.06347",
256
+ "equation": "L^{CLIP}(\\theta) = \\hat{\\mathbb{E}}_t [ \\min(r_t(\\theta)\\hat{A}_t, \\text{clip}(r_t(\\theta), 1 - \\epsilon, 1 + \\epsilon)\\hat{A}_t) ]"
257
+ },
258
+ "DQN": {
259
+ "description": "Deep Q-Network (DQN) uses deep neural networks to approximate the Q-value function in reinforcement learning.",
260
+ "paper": "https://arxiv.org/abs/1312.5602",
261
+ "equation": "L_i(\\theta_i) = \\mathbb{E}_{s,a,r,s'}[(r + \\gamma \\max_{a'} Q(s',a'; \\theta_i^-) - Q(s,a;\\theta_i))^2]"
262
+ },
263
+ "SAC": {
264
+ "description": "Soft Actor-Critic (SAC) is an off-policy actor-critic algorithm that maximizes a trade-off between expected return and entropy.",
265
+ "paper": "https://arxiv.org/abs/1812.05905",
266
+ "equation": "J(\\pi) = \\sum_t \\mathbb{E}_{(s_t, a_t) \\sim \\rho_\\pi} [r(s_t, a_t) + \\alpha \\mathcal{H}(\\pi(\\cdot|s_t))]"
267
+ },
268
+ "TD3": {
269
+ "description": "Twin Delayed DDPG (TD3) addresses overestimation bias in actor-critic methods by using two critics and target policy smoothing.",
270
+ "paper": "https://arxiv.org/abs/1802.09477",
271
+ "equation": "L(\\theta) = \\mathbb{E}[(r + \\gamma \\min_{i=1,2} Q_i(s', \\pi(s')) - Q(s,a))^2]"
272
+ }
273
+ }
274
+
275
+ # Environment descriptions
276
+ env_info = {
277
+ "CartPole-v1": "**CartPole-v1**\n\n- Goal: Keep the pole balanced upright on a moving cart.\n- Observation: Cart position/velocity, pole angle/angular velocity (4D).\n- Action Space: Discrete (left or right).\n- Reward: +1 per time step the pole is upright.\n- Termination: Pole falls or cart moves out of bounds.\n- Challenge: Requires rapid corrections; sensitive to delayed actions.",
278
+
279
+ "MountainCarContinuous-v0": "**MountainCarContinuous-v0**\n\n- Goal: Drive the car up the right hill to reach the flag.\n- Observation: Position and velocity (2D).\n- Action Space: Continuous (thrust left/right).\n- Reward: +100 for reaching goal, small negative each step.\n- Termination: 200 steps or reaching the goal.\n- Challenge: Sparse reward, needs exploration to gain momentum.",
280
+
281
+ "MountainCar-v0": "**MountainCar-v0**\n\n- Goal: Drive the car up the right hill to reach the flag.\n- Observation: Position and velocity (2D).\n- Action Space: Discrete (left, neutral, right).\n- Reward: -1 per step, 0 upon reaching goal.\n- Termination: 200 steps or reaching the goal.\n- Challenge: Very sparse reward, requires building momentum through oscillations.",
282
+
283
+ "Pendulum-v1": "**Pendulum-v1**\n\n- Goal: Swing a pendulum to upright position and balance it.\n- Observation: Sine/cosine of angle, angular velocity (3D).\n- Action Space: Continuous (torque).\n- Reward: Negative cost based on angle from vertical and energy use.\n- Termination: After 200 steps (no early termination).\n- Challenge: Requires energy-efficient control and dealing with momentum."
284
+ }
285
+
286
+ # Mapping of algorithms to supported environments
287
+ algo_to_env = {
288
+ "PPO": ["CartPole-v1", "MountainCarContinuous-v0"],
289
+ "SAC": ["MountainCarContinuous-v0", "Pendulum-v1"],
290
+ "TD3": ["MountainCarContinuous-v0", "Pendulum-v1"],
291
+ "DQN": ["MountainCar-v0", "CartPole-v1"]
292
+ }
293
+
294
+ # Interface
295
+ with gr.Blocks() as demo:
296
+ gr.Markdown("""
297
+ # Reinforcement Learning Algorithm Explorer
298
+ Select an algorithm to learn more, then run it on a supported environment.
299
+
300
+ **Environment**: A simulation where an agent takes actions to maximize rewards. Each interaction loop consists of: observation → action → reward → new state. The agent learns to optimize future rewards.
301
+ """)
302
+
303
+ algo_dropdown = gr.Dropdown(["PPO", "DQN", "SAC", "TD3"], label="Algorithm")
304
+ algo_description = gr.Markdown()
305
+ algo_equation = gr.Markdown()
306
+ algo_link = gr.Markdown()
307
+
308
+ env_dropdown = gr.Dropdown(label="Environment")
309
+ env_description = gr.Markdown()
310
+
311
+ run_button = gr.Button("Run")
312
+ plot_output = gr.Image(label="Reward Curve")
313
+ video_output = gr.Video(label="Agent Behavior Video")
314
+
315
+ # Hyperparameter plot outputs
316
+ hyperparams_accordion = gr.Accordion("Hyperparameter Analysis", open=False, visible=False)
317
+ with hyperparams_accordion:
318
+ gr.Markdown("### Hyperparameter Sensitivity Analysis")
319
+ with gr.Row():
320
+ hyperparam_img1 = gr.Image(label="", show_label=False, visible=False, height=400)
321
+ hyperparam_img2 = gr.Image(label="", show_label=False, visible=False, height=400)
322
+ with gr.Row():
323
+ hyperparam_img3 = gr.Image(label="", show_label=False, visible=False, height=400)
324
+ hyperparam_img4 = gr.Image(label="", show_label=False, visible=False, height=400)
325
+ with gr.Row():
326
+ hyperparam_img5 = gr.Image(label="", show_label=False, visible=False, height=400)
327
+ hyperparam_img6 = gr.Image(label="", show_label=False, visible=False, height=400)
328
+
329
+ # Implementation details
330
+ implementation_output = gr.Markdown(label="Implementation Details")
331
+
332
+ def update_algo_info(algo):
333
+ info = algo_info.get(algo, {})
334
+ return (
335
+ info.get("description", ""),
336
+ f"**Equation**: $${info.get('equation', '')}$$",
337
+ f"[Read the paper]({info.get('paper', '#')})"
338
+ )
339
+
340
+ def update_env_info(env):
341
+ return env_info.get(env, "")
342
+
343
+ def filter_envs(algo):
344
+ return gr.update(choices=algo_to_env.get(algo, []), value=algo_to_env.get(algo, [])[0] if algo_to_env.get(algo, []) else None)
345
+
346
+ def serve_model(env_name, algorithm):
347
+ combo_key = f"{env_name}_{algorithm}"
348
+
349
+ # Show/hide hyperparameter accordion based on selection
350
+ # Only show hyperparams for combinations that were actually tested
351
+ show_hyperparams = combo_key in ["CartPole-v1_PPO", "MountainCarContinuous-v0_PPO", "Pendulum-v1_TD3", "CartPole-v1_DQN", "MountainCarContinuous-v0_SAC"]
352
+
353
+ # Map each algorithm-environment combination to its plot and video paths
354
+ # Updated paths based on your repository structure
355
+ paths = {
356
+ "CartPole-v1_PPO": ("src/Results/PPO_cartpole_smoothed_rewards.png", "src/Videos/PPO_cartpole_seed0.mp4"),
357
+ "MountainCarContinuous-v0_PPO": ("src/Results/PPO_mountaincar_smoothed_rewards.png", "src/Videos/PPO_mountaincar_seed0-episode-0.mp4"),
358
+ "MountainCarContinuous-v0_SAC": ("src/Results/SAC_mountaincar_smoothed_rewards.png", "src/Videos/SAC_MountainCarContinuous.mp4"),
359
+ "Pendulum-v1_SAC": ("src/Results/SAC_pendulum_smoothed_rewards.png", "src/Videos/SAC_Pendulum.mp4"),
360
+ "MountainCarContinuous-v0_TD3": ("src/Results/TD3_pendulum_smoothed_rewards.png", "src/Videos/TD3_MountainCarContinuous.mp4"),
361
+ "Pendulum-v1_TD3": ("src/Results/TD3_pendulum_smoothed_rewards.png", "src/Videos/TD3_Pendulum.mp4"),
362
+ "MountainCar-v0_DQN": ("src/Results/DQN_mountaincar_smoothed_rewards.png", "src/Videos/DQN_mountaincar_best.mp4"),
363
+ "CartPole-v1_DQN": ("src/Results/cartpole_comparison_smoothed_rewards.png", "src/Videos/DQN_cartpole_best.mp4")
364
+ }
365
+
366
+ # Hyperparameter paths for different environments
367
+ # Only include combinations that were actually tested
368
+ hyperparameter_paths = {
369
+ "CartPole-v1_PPO": [
370
+ "src/Results/Hyperparameters/PPO_GAMMA_comparison.png",
371
+ "src/Results/Hyperparameters/PPO_EPS_comparison.png",
372
+ "src/Results/Hyperparameters/PPO_LR_comparison.png",
373
+ "src/Results/Hyperparameters/PPO_K_comparison.png"
374
+ ],
375
+ "MountainCarContinuous-v0_PPO": [
376
+ "src/Results/Hyperparameters/PPO_MountainCar_GAMMA_comparison.png",
377
+ "src/Results/Hyperparameters/PPO_MountainCar_CLIP_EPSILON_comparison.png",
378
+ "src/Results/Hyperparameters/PPO_MountainCar_EPOCHS_comparison.png",
379
+ "src/Results/Hyperparameters/PPO_MountainCar_GAE_LAMBDA_comparison.png",
380
+ "src/Results/PPO_MountainCar_ACTION_STD_comparison.png",
381
+ "src/Results/Hyperparameters/PPO_MountainCar_LR_ACTOR_comparison.png"
382
+ ],
383
+ "Pendulum-v1_TD3": [
384
+ "src/Results/Hyperparameters/td3_hyperparam.png"
385
+ ],
386
+ "CartPole-v1_DQN": [
387
+ "src/Results/Hyperparameters/DQN_Hyperparameters.jpg"
388
+ ],
389
+ "MountainCarContinuous-v0_SAC": [
390
+ "src/Results/Hyperparameters/SAC_tau.jpg",
391
+ "src/Results/Hyperparameters/SAC_lr.jpg",
392
+ "src/Results/Hyperparameters/SAC_alpha.jpg",
393
+ "src/Results/Hyperparameters/SAC_Gamma.jpg"
394
+ ]
395
+ }
396
+
397
+ if combo_key in paths:
398
+ plot_path, video_path = paths[combo_key]
399
+
400
+ # Check if the files exist
401
+ plot_exists = os.path.exists(plot_path)
402
+ video_exists = os.path.exists(video_path)
403
+
404
+ if not plot_exists:
405
+ print(f"Warning: Plot file {plot_path} not found.")
406
+
407
+ if not video_exists:
408
+ print(f"Warning: Video file {video_path} not found.")
409
+
410
+ # Get implementation details
411
+ implementation_details = implementation_info.get(combo_key, "Implementation details not available.")
412
+
413
+ # Initialize all hyperparameter images as None
414
+ img1 = img2 = img3 = img4 = img5 = img6 = None
415
+ vis1 = vis2 = vis3 = vis4 = vis5 = vis6 = False
416
+
417
+ # Get hyperparameter plots if applicable
418
+ if combo_key in hyperparameter_paths:
419
+ hyperparam_files = []
420
+ for h_path in hyperparameter_paths[combo_key]:
421
+ if os.path.exists(h_path):
422
+ hyperparam_files.append(h_path)
423
+ else:
424
+ print(f"Warning: Hyperparameter plot {h_path} not found.")
425
+
426
+ # Assign images to slots
427
+ if len(hyperparam_files) >= 1:
428
+ img1, vis1 = hyperparam_files[0], True
429
+ if len(hyperparam_files) >= 2:
430
+ img2, vis2 = hyperparam_files[1], True
431
+ if len(hyperparam_files) >= 3:
432
+ img3, vis3 = hyperparam_files[2], True
433
+ if len(hyperparam_files) >= 4:
434
+ img4, vis4 = hyperparam_files[3], True
435
+ if len(hyperparam_files) >= 5:
436
+ img5, vis5 = hyperparam_files[4], True
437
+ if len(hyperparam_files) >= 6:
438
+ img6, vis6 = hyperparam_files[5], True
439
+
440
+ # Return all data including visibility update for accordion and individual images
441
+ return (plot_path, video_path, implementation_details, gr.update(visible=show_hyperparams),
442
+ gr.update(value=img1, visible=vis1), gr.update(value=img2, visible=vis2),
443
+ gr.update(value=img3, visible=vis3), gr.update(value=img4, visible=vis4),
444
+ gr.update(value=img5, visible=vis5), gr.update(value=img6, visible=vis6))
445
+ else:
446
+ # Return without hyperparameter plots for other combinations
447
+ return (plot_path, video_path, implementation_details, gr.update(visible=show_hyperparams),
448
+ gr.update(value=None, visible=False), gr.update(value=None, visible=False),
449
+ gr.update(value=None, visible=False), gr.update(value=None, visible=False),
450
+ gr.update(value=None, visible=False), gr.update(value=None, visible=False))
451
+ else:
452
+ return ("This combination is not supported yet.", None, "Implementation details not available.", gr.update(visible=False),
453
+ gr.update(value=None, visible=False), gr.update(value=None, visible=False),
454
+ gr.update(value=None, visible=False), gr.update(value=None, visible=False),
455
+ gr.update(value=None, visible=False), gr.update(value=None, visible=False))
456
+
457
+ algo_dropdown.change(fn=update_algo_info, inputs=algo_dropdown, outputs=[algo_description, algo_equation, algo_link])
458
+ algo_dropdown.change(fn=filter_envs, inputs=algo_dropdown, outputs=env_dropdown)
459
+ env_dropdown.change(fn=update_env_info, inputs=env_dropdown, outputs=env_description)
460
+ run_button.click(fn=serve_model, inputs=[env_dropdown, algo_dropdown],
461
+ outputs=[plot_output, video_output, implementation_output, hyperparams_accordion,
462
+ hyperparam_img1, hyperparam_img2, hyperparam_img3,
463
+ hyperparam_img4, hyperparam_img5, hyperparam_img6])
464
+
465
+ if __name__ == "__main__":
466
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio==5.31.0
2
+ gym==0.25.2
3
+ gymnasium==1.1.1
4
+ matplotlib==3.7.2
5
+ numpy==2.2.6
6
+ opencv_python==4.11.0.86
7
+ pandas==2.2.3
8
+ seaborn==0.13.2
9
+ torch==2.6.0