tags:
- LunarLander-v2
- ppo
- deep-reinforcement-learning
- reinforcement-learning
- custom-implementation
model-index:
- name: PPO
results:
- task:
type: reinforcement-learning
name: reinforcement-learning
dataset:
name: LunarLander-v2
type: LunarLander-v2
metrics:
- type: mean_reward
value: '-113.57 +/- 74.63'
name: mean_reward
verified: false
PPO Agent for LunarLander-v2
Model Description
This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm.
Model Details
- Model Type: Reinforcement Learning Agent (PPO)
- Architecture: Actor-Critic Neural Network
- Framework: PyTorch
- Environment: LunarLander-v2 (OpenAI Gym)
- Algorithm: Proximal Policy Optimization (PPO)
- Training Library: Custom PyTorch implementation
Training Details
Hyperparameters
Parameter | Value |
---|---|
Total Timesteps | 50,000 |
Learning Rate | 0.00025 |
Number of Environments | 4 |
Steps per Environment | 128 |
Batch Size | 512 |
Minibatch Size | 128 |
Number of Minibatches | 4 |
Update Epochs | 4 |
Discount Factor (γ) | 0.99 |
GAE Lambda (λ) | 0.95 |
Clip Coefficient | 0.2 |
Value Function Coefficient | 0.5 |
Entropy Coefficient | 0.01 |
Max Gradient Norm | 0.5 |
Training Configuration
- Seed: 1 (for reproducibility)
- Device: CUDA enabled
- Learning Rate Annealing: Enabled
- Generalized Advantage Estimation (GAE): Enabled
- Advantage Normalization: Enabled
- Value Loss Clipping: Enabled
Performance
Evaluation Results
- Environment: LunarLander-v2
- Mean Reward: -113.57 ± 74.63
The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes.
Usage
This model can be used for:
- Reinforcement learning research and experimentation
- Educational purposes to understand PPO implementation
- Baseline comparison for LunarLander-v2 experiments
- Fine-tuning starting point for similar control tasks
Technical Implementation
Architecture Details
The model uses an Actor-Critic architecture implemented in PyTorch:
- Actor Network: Outputs action probabilities for the discrete action space
- Critic Network: Estimates state values for advantage computation
- Shared Features: Common feature extraction layers (if applicable)
PPO Algorithm Features
- Clipped Surrogate Objective: Prevents large policy updates
- Value Function Clipping: Stabilizes value function learning
- Generalized Advantage Estimation: Reduces variance in advantage estimates
- Multiple Epochs: Updates policy multiple times per batch of experience
Environment Information
LunarLander-v2 is a classic control task where an agent must learn to:
- Land a lunar lander safely on a landing pad
- Control thrust and rotation to manage descent
- Balance fuel efficiency with landing accuracy
- Handle continuous state space and discrete action space
Action Space: Discrete(4)
- 0: Do nothing
- 1: Fire left orientation engine
- 2: Fire main engine
- 3: Fire right orientation engine
Observation Space: Box(8) containing:
- Position (x, y)
- Velocity (x, y)
- Angle and angular velocity
- Left and right leg ground contact
Training Environment
- Framework: Custom PyTorch PPO implementation
- Parallel Environments: 4 concurrent environments for data collection
- Total Training Time: 50,000 timesteps across all environments
- Experience Collection: On-policy learning with trajectory batches
Limitations and Considerations
- The model shows moderate performance with high variance in rewards
- Training was limited to 50,000 timesteps, which may be insufficient for optimal performance
- Performance may vary significantly across different episodes due to the stochastic nature of the environment
- The model has not been tested on variations of the LunarLander environment
Citation
If you use this model in your research, please cite:
@misc{cllv2_ppo_lunarlander,
author = {Adilbai},
title = {PPO Agent for LunarLander-v2},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/Adilbai/cLLv2}
}
License
Please refer to the repository license for usage terms and conditions.
Contact
For questions or issues regarding this model, please open an issue in the model repository or contact the model author.