File size: 4,598 Bytes
4d75c36
 
40eefc0
 
4d75c36
 
40eefc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d75c36
59b128b
4d75c36
59b128b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
tags:
- LunarLander-v2
- ppo
- deep-reinforcement-learning
- reinforcement-learning
- custom-implementation
model-index:
- name: PPO
  results:
  - task:
      type: reinforcement-learning
      name: reinforcement-learning
    dataset:
      name: LunarLander-v2
      type: LunarLander-v2
    metrics:
    - type: mean_reward
      value: -113.57 +/- 74.63
      name: mean_reward
      verified: false
---
# PPO Agent for LunarLander-v2

## Model Description

This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm.

## Model Details

- **Model Type**: Reinforcement Learning Agent (PPO)
- **Architecture**: Actor-Critic Neural Network
- **Framework**: PyTorch
- **Environment**: LunarLander-v2 (OpenAI Gym)
- **Algorithm**: Proximal Policy Optimization (PPO)
- **Training Library**: Custom PyTorch implementation

## Training Details

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Total Timesteps | 50,000 |
| Learning Rate | 0.00025 |
| Number of Environments | 4 |
| Steps per Environment | 128 |
| Batch Size | 512 |
| Minibatch Size | 128 |
| Number of Minibatches | 4 |
| Update Epochs | 4 |
| Discount Factor (γ) | 0.99 |
| GAE Lambda (λ) | 0.95 |
| Clip Coefficient | 0.2 |
| Value Function Coefficient | 0.5 |
| Entropy Coefficient | 0.01 |
| Max Gradient Norm | 0.5 |

### Training Configuration

- **Seed**: 1 (for reproducibility)
- **Device**: CUDA enabled
- **Learning Rate Annealing**: Enabled
- **Generalized Advantage Estimation (GAE)**: Enabled
- **Advantage Normalization**: Enabled
- **Value Loss Clipping**: Enabled

## Performance

### Evaluation Results

- **Environment**: LunarLander-v2
- **Mean Reward**: -113.57 ± 74.63

The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes.

## Usage

This model can be used for:
- Reinforcement learning research and experimentation
- Educational purposes to understand PPO implementation
- Baseline comparison for LunarLander-v2 experiments
- Fine-tuning starting point for similar control tasks

## Technical Implementation

### Architecture Details

The model uses an Actor-Critic architecture implemented in PyTorch:
- **Actor Network**: Outputs action probabilities for the discrete action space
- **Critic Network**: Estimates state values for advantage computation
- **Shared Features**: Common feature extraction layers (if applicable)

### PPO Algorithm Features

- **Clipped Surrogate Objective**: Prevents large policy updates
- **Value Function Clipping**: Stabilizes value function learning
- **Generalized Advantage Estimation**: Reduces variance in advantage estimates
- **Multiple Epochs**: Updates policy multiple times per batch of experience

## Environment Information

**LunarLander-v2** is a classic control task where an agent must learn to:
- Land a lunar lander safely on a landing pad
- Control thrust and rotation to manage descent
- Balance fuel efficiency with landing accuracy
- Handle continuous state space and discrete action space

**Action Space**: Discrete(4)
- 0: Do nothing
- 1: Fire left orientation engine
- 2: Fire main engine
- 3: Fire right orientation engine

**Observation Space**: Box(8) containing:
- Position (x, y)
- Velocity (x, y)
- Angle and angular velocity
- Left and right leg ground contact

## Training Environment

- **Framework**: Custom PyTorch PPO implementation
- **Parallel Environments**: 4 concurrent environments for data collection
- **Total Training Time**: 50,000 timesteps across all environments
- **Experience Collection**: On-policy learning with trajectory batches

## Limitations and Considerations

- The model shows moderate performance with high variance in rewards
- Training was limited to 50,000 timesteps, which may be insufficient for optimal performance
- Performance may vary significantly across different episodes due to the stochastic nature of the environment
- The model has not been tested on variations of the LunarLander environment

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{cllv2_ppo_lunarlander,
  author = {Adilbai},
  title = {PPO Agent for LunarLander-v2},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Adilbai/cLLv2}
}
```

## License

Please refer to the repository license for usage terms and conditions.

## Contact

For questions or issues regarding this model, please open an issue in the model repository or contact the model author.