cLLv2 / README.md

Update README.md

59b128b verified 3 months ago

4.6 kB

	---
	tags:
	- LunarLander-v2
	- ppo
	- deep-reinforcement-learning
	- reinforcement-learning
	- custom-implementation
	model-index:
	- name: PPO
	results:
	- task:
	type: reinforcement-learning
	name: reinforcement-learning
	dataset:
	name: LunarLander-v2
	type: LunarLander-v2
	metrics:
	- type: mean_reward
	value: -113.57 +/- 74.63
	name: mean_reward
	verified: false
	---
	# PPO Agent for LunarLander-v2

	## Model Description

	This is a Proximal Policy Optimization (PPO) agent trained to play the LunarLander-v2 environment from OpenAI Gym. The model was trained using a custom PyTorch implementation of the PPO algorithm.

	## Model Details

	- Model Type: Reinforcement Learning Agent (PPO)
	- Architecture: Actor-Critic Neural Network
	- Framework: PyTorch
	- Environment: LunarLander-v2 (OpenAI Gym)
	- Algorithm: Proximal Policy Optimization (PPO)
	- Training Library: Custom PyTorch implementation

	## Training Details

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Total Timesteps \| 50,000 \|
	\| Learning Rate \| 0.00025 \|
	\| Number of Environments \| 4 \|
	\| Steps per Environment \| 128 \|
	\| Batch Size \| 512 \|
	\| Minibatch Size \| 128 \|
	\| Number of Minibatches \| 4 \|
	\| Update Epochs \| 4 \|
	\| Discount Factor (γ) \| 0.99 \|
	\| GAE Lambda (λ) \| 0.95 \|
	\| Clip Coefficient \| 0.2 \|
	\| Value Function Coefficient \| 0.5 \|
	\| Entropy Coefficient \| 0.01 \|
	\| Max Gradient Norm \| 0.5 \|

	### Training Configuration

	- Seed: 1 (for reproducibility)
	- Device: CUDA enabled
	- Learning Rate Annealing: Enabled
	- Generalized Advantage Estimation (GAE): Enabled
	- Advantage Normalization: Enabled
	- Value Loss Clipping: Enabled

	## Performance

	### Evaluation Results

	- Environment: LunarLander-v2
	- Mean Reward: -113.57 ± 74.63

	The agent achieves a mean reward of -113.57 with a standard deviation of 74.63 over evaluation episodes.

	## Usage

	This model can be used for:
	- Reinforcement learning research and experimentation
	- Educational purposes to understand PPO implementation
	- Baseline comparison for LunarLander-v2 experiments
	- Fine-tuning starting point for similar control tasks

	## Technical Implementation

	### Architecture Details

	The model uses an Actor-Critic architecture implemented in PyTorch:
	- Actor Network: Outputs action probabilities for the discrete action space
	- Critic Network: Estimates state values for advantage computation
	- Shared Features: Common feature extraction layers (if applicable)

	### PPO Algorithm Features

	- Clipped Surrogate Objective: Prevents large policy updates
	- Value Function Clipping: Stabilizes value function learning
	- Generalized Advantage Estimation: Reduces variance in advantage estimates
	- Multiple Epochs: Updates policy multiple times per batch of experience

	## Environment Information

	LunarLander-v2 is a classic control task where an agent must learn to:
	- Land a lunar lander safely on a landing pad
	- Control thrust and rotation to manage descent
	- Balance fuel efficiency with landing accuracy
	- Handle continuous state space and discrete action space

	Action Space: Discrete(4)
	- 0: Do nothing
	- 1: Fire left orientation engine
	- 2: Fire main engine
	- 3: Fire right orientation engine

	Observation Space: Box(8) containing:
	- Position (x, y)
	- Velocity (x, y)
	- Angle and angular velocity
	- Left and right leg ground contact

	## Training Environment

	- Framework: Custom PyTorch PPO implementation
	- Parallel Environments: 4 concurrent environments for data collection
	- Total Training Time: 50,000 timesteps across all environments
	- Experience Collection: On-policy learning with trajectory batches

	## Limitations and Considerations

	- The model shows moderate performance with high variance in rewards
	- Training was limited to 50,000 timesteps, which may be insufficient for optimal performance
	- Performance may vary significantly across different episodes due to the stochastic nature of the environment
	- The model has not been tested on variations of the LunarLander environment

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{cllv2_ppo_lunarlander,
	author = {Adilbai},
	title = {PPO Agent for LunarLander-v2},
	year = {2024},
	publisher = {Hugging Face},
	url = {https://huggingface.co/Adilbai/cLLv2}
	}
	```

	## License

	Please refer to the repository license for usage terms and conditions.

	## Contact

	For questions or issues regarding this model, please open an issue in the model repository or contact the model author.