|
--- |
|
tags: |
|
- LunarLander-v2 |
|
- ppo |
|
- deep-reinforcement-learning |
|
- reinforcement-learning |
|
- custom-implementation |
|
model-index: |
|
- name: PPO |
|
results: |
|
- task: |
|
type: reinforcement-learning |
|
name: reinforcement-learning |
|
dataset: |
|
name: LunarLander-v2 |
|
type: LunarLander-v2 |
|
metrics: |
|
- type: mean_reward |
|
value: -113.57 +/- 74.63 |
|
name: mean_reward |
|
verified: false |
|
--- |
|
# PPO Agent for LunarLander-v2 |
|
|
|
## Model Description |
|
|
|
This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm. |
|
|
|
## Model Details |
|
|
|
- **Model Type**: Reinforcement Learning Agent (PPO) |
|
- **Architecture**: Actor-Critic Neural Network |
|
- **Framework**: PyTorch |
|
- **Environment**: LunarLander-v2 (OpenAI Gym) |
|
- **Algorithm**: Proximal Policy Optimization (PPO) |
|
- **Training Library**: Custom PyTorch implementation |
|
|
|
## Training Details |
|
|
|
### Hyperparameters |
|
|
|
| Parameter | Value | |
|
|-----------|-------| |
|
| Total Timesteps | 50,000 | |
|
| Learning Rate | 0.00025 | |
|
| Number of Environments | 4 | |
|
| Steps per Environment | 128 | |
|
| Batch Size | 512 | |
|
| Minibatch Size | 128 | |
|
| Number of Minibatches | 4 | |
|
| Update Epochs | 4 | |
|
| Discount Factor (γ) | 0.99 | |
|
| GAE Lambda (λ) | 0.95 | |
|
| Clip Coefficient | 0.2 | |
|
| Value Function Coefficient | 0.5 | |
|
| Entropy Coefficient | 0.01 | |
|
| Max Gradient Norm | 0.5 | |
|
|
|
### Training Configuration |
|
|
|
- **Seed**: 1 (for reproducibility) |
|
- **Device**: CUDA enabled |
|
- **Learning Rate Annealing**: Enabled |
|
- **Generalized Advantage Estimation (GAE)**: Enabled |
|
- **Advantage Normalization**: Enabled |
|
- **Value Loss Clipping**: Enabled |
|
|
|
## Performance |
|
|
|
### Evaluation Results |
|
|
|
- **Environment**: LunarLander-v2 |
|
- **Mean Reward**: -113.57 ± 74.63 |
|
|
|
The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes. |
|
|
|
## Usage |
|
|
|
This model can be used for: |
|
- Reinforcement learning research and experimentation |
|
- Educational purposes to understand PPO implementation |
|
- Baseline comparison for LunarLander-v2 experiments |
|
- Fine-tuning starting point for similar control tasks |
|
|
|
## Technical Implementation |
|
|
|
### Architecture Details |
|
|
|
The model uses an Actor-Critic architecture implemented in PyTorch: |
|
- **Actor Network**: Outputs action probabilities for the discrete action space |
|
- **Critic Network**: Estimates state values for advantage computation |
|
- **Shared Features**: Common feature extraction layers (if applicable) |
|
|
|
### PPO Algorithm Features |
|
|
|
- **Clipped Surrogate Objective**: Prevents large policy updates |
|
- **Value Function Clipping**: Stabilizes value function learning |
|
- **Generalized Advantage Estimation**: Reduces variance in advantage estimates |
|
- **Multiple Epochs**: Updates policy multiple times per batch of experience |
|
|
|
## Environment Information |
|
|
|
**LunarLander-v2** is a classic control task where an agent must learn to: |
|
- Land a lunar lander safely on a landing pad |
|
- Control thrust and rotation to manage descent |
|
- Balance fuel efficiency with landing accuracy |
|
- Handle continuous state space and discrete action space |
|
|
|
**Action Space**: Discrete(4) |
|
- 0: Do nothing |
|
- 1: Fire left orientation engine |
|
- 2: Fire main engine |
|
- 3: Fire right orientation engine |
|
|
|
**Observation Space**: Box(8) containing: |
|
- Position (x, y) |
|
- Velocity (x, y) |
|
- Angle and angular velocity |
|
- Left and right leg ground contact |
|
|
|
## Training Environment |
|
|
|
- **Framework**: Custom PyTorch PPO implementation |
|
- **Parallel Environments**: 4 concurrent environments for data collection |
|
- **Total Training Time**: 50,000 timesteps across all environments |
|
- **Experience Collection**: On-policy learning with trajectory batches |
|
|
|
## Limitations and Considerations |
|
|
|
- The model shows moderate performance with high variance in rewards |
|
- Training was limited to 50,000 timesteps, which may be insufficient for optimal performance |
|
- Performance may vary significantly across different episodes due to the stochastic nature of the environment |
|
- The model has not been tested on variations of the LunarLander environment |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@misc{cllv2_ppo_lunarlander, |
|
author = {Adilbai}, |
|
title = {PPO Agent for LunarLander-v2}, |
|
year = {2024}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/Adilbai/cLLv2} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Please refer to the repository license for usage terms and conditions. |
|
|
|
## Contact |
|
|
|
For questions or issues regarding this model, please open an issue in the model repository or contact the model author. |