cLLv2

File size: 4,598 Bytes

4d75c36
 
40eefc0
 
4d75c36
 
40eefc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d75c36
59b128b
4d75c36
59b128b

---
tags:
- LunarLander-v2
- ppo
- deep-reinforcement-learning
- reinforcement-learning
- custom-implementation
model-index:
- name: PPO
  results:
  - task:
      type: reinforcement-learning
      name: reinforcement-learning
    dataset:
      name: LunarLander-v2
      type: LunarLander-v2
    metrics:
    - type: mean_reward
      value: -113.57 +/- 74.63
      name: mean_reward
      verified: false
---
# PPO Agent for LunarLander-v2

## Model Description

This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm.

## Model Details

- **Model Type**: Reinforcement Learning Agent (PPO)
- **Architecture**: Actor-Critic Neural Network
- **Framework**: PyTorch
- **Environment**: LunarLander-v2 (OpenAI Gym)
- **Algorithm**: Proximal Policy Optimization (PPO)
- **Training Library**: Custom PyTorch implementation

## Training Details

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Total Timesteps | 50,000 |
| Learning Rate | 0.00025 |
| Number of Environments | 4 |
| Steps per Environment | 128 |
| Batch Size | 512 |
| Minibatch Size | 128 |
| Number of Minibatches | 4 |
| Update Epochs | 4 |
| Discount Factor (γ) | 0.99 |
| GAE Lambda (λ) | 0.95 |
| Clip Coefficient | 0.2 |
| Value Function Coefficient | 0.5 |
| Entropy Coefficient | 0.01 |
| Max Gradient Norm | 0.5 |

### Training Configuration

- **Seed**: 1 (for reproducibility)
- **Device**: CUDA enabled
- **Learning Rate Annealing**: Enabled
- **Generalized Advantage Estimation (GAE)**: Enabled
- **Advantage Normalization**: Enabled
- **Value Loss Clipping**: Enabled

## Performance

### Evaluation Results

- **Environment**: LunarLander-v2
- **Mean Reward**: -113.57 ± 74.63

The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes.

## Usage

This model can be used for:
- Reinforcement learning research and experimentation
- Educational purposes to understand PPO implementation
- Baseline comparison for LunarLander-v2 experiments
- Fine-tuning starting point for similar control tasks

## Technical Implementation

### Architecture Details

The model uses an Actor-Critic architecture implemented in PyTorch:
- **Actor Network**: Outputs action probabilities for the discrete action space
- **Critic Network**: Estimates state values for advantage computation
- **Shared Features**: Common feature extraction layers (if applicable)

### PPO Algorithm Features

- **Clipped Surrogate Objective**: Prevents large policy updates
- **Value Function Clipping**: Stabilizes value function learning
- **Generalized Advantage Estimation**: Reduces variance in advantage estimates
- **Multiple Epochs**: Updates policy multiple times per batch of experience

## Environment Information

**LunarLander-v2** is a classic control task where an agent must learn to:
- Land a lunar lander safely on a landing pad
- Control thrust and rotation to manage descent
- Balance fuel efficiency with landing accuracy
- Handle continuous state space and discrete action space

**Action Space**: Discrete(4)
- 0: Do nothing
- 1: Fire left orientation engine
- 2: Fire main engine
- 3: Fire right orientation engine

**Observation Space**: Box(8) containing:
- Position (x, y)
- Velocity (x, y)
- Angle and angular velocity
- Left and right leg ground contact

## Training Environment

- **Framework**: Custom PyTorch PPO implementation
- **Parallel Environments**: 4 concurrent environments for data collection
- **Total Training Time**: 50,000 timesteps across all environments
- **Experience Collection**: On-policy learning with trajectory batches

## Limitations and Considerations

- The model shows moderate performance with high variance in rewards
- Training was limited to 50,000 timesteps, which may be insufficient for optimal performance
- Performance may vary significantly across different episodes due to the stochastic nature of the environment
- The model has not been tested on variations of the LunarLander environment

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{cllv2_ppo_lunarlander,
  author = {Adilbai},
  title = {PPO Agent for LunarLander-v2},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Adilbai/cLLv2}
}
```

## License

Please refer to the repository license for usage terms and conditions.

## Contact

For questions or issues regarding this model, please open an issue in the model repository or contact the model author.