--- language: - en liscense: - Apache 2.0 base_model: mistralai/Mistral-7B-Instruct-v0.1 --- # GeoEmbedding The GeoEmbedding model is a geoscience-specific text embedding model built upon a high-performance large language model and fine-tuned on both general-purpose and in-domain geoscientific datasets. It produces accurate, context-aware vector representations of geoscientific texts, forming the backbone of vector-based retrieval in the RAG pipeline. > 👉For full documentation, see: https://github.com/GeoGPT-Research-Project/GeoGPT-RAG ## Quickstart To load the GeoEmbedding model with Transformer, use the following snippet: ```python import numpy as np from sentence_transformers import SentenceTransformer task_description = 'Given a web search query, retrieve relevant passages that answer the query' def get_detailed_instruct(task_description: str, query: str) -> str: return f'Instruct: {task_description}\nQuery: {query}' model_name_or_path = 'GeoGPT/GeoEmbedding' model = SentenceTransformer(model_name_or_path, device="cuda", trust_remote_code=True) queries = [ "What is the main cause of earthquakes?", "How do sedimentary rocks form?", ] passages = [ "Earthquakes occur due to the sudden release of energy in the Earth's crust, often caused by tectonic plate movements along fault lines.", "Sedimentary rocks form through the deposition and compaction of mineral and organic particles over time, typically in water bodies.", ] queries = [get_detailed_instruct(task_description, query) for query in queries] q_vecs = model.encode(queries, normalize_embeddings=True) p_vecs = model.encode(passages, normalize_embeddings=True) print(np.dot(q_vecs, p_vecs.T)) #[[0.6369 0.2092 ] # [0.2499 0.8411 ]] ``` ## License and Uses GeoEmbedding is liscensed under [Apache License 2.0](https://github.com/GeoGPT-Research-Project/GeoGPT-RAG/blob/master/Apache_LICENSE). GeoEmbedding is trained on the foundation of [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1), which is also licensed under the [Apache License 2.0](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md). It is your responsibility to ensure that your use of GeoEmbedding adheres to the terms of both the GeoEmbedding model and its upstream dependency, [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1). The model is not intended for use in any manner that violates applicable laws or regulations, nor for any activities prohibited by the license agreement. Additionally, it should not be used in languages other than those explicitly supported, as outlined in this model card. ## Limitations GeoEmbedding is trained on English datasets, and performance may be suboptimal for other languages.