GeoEmbedding

The GeoEmbedding model is a geoscience-specific text embedding model built upon a high-performance large language model and fine-tuned on both general-purpose and in-domain geoscientific datasets. It produces accurate, context-aware vector representations of geoscientific texts, forming the backbone of vector-based retrieval in the RAG pipeline.

👉For full documentation, see: https://github.com/GeoGPT-Research-Project/GeoGPT-RAG

Quickstart

To load the GeoEmbedding model with Transformer, use the following snippet:

import numpy as np
from sentence_transformers import SentenceTransformer

task_description = 'Given a web search query, retrieve relevant passages that answer the query'
def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

model_name_or_path = 'GeoGPT/GeoEmbedding'

model = SentenceTransformer(model_name_or_path, device="cuda", trust_remote_code=True)

queries = [
    "What is the main cause of earthquakes?",
    "How do sedimentary rocks form?",
]

passages = [
    "Earthquakes occur due to the sudden release of energy in the Earth's crust, often caused by tectonic plate movements along fault lines.",
    "Sedimentary rocks form through the deposition and compaction of mineral and organic particles over time, typically in water bodies.",
]

queries = [get_detailed_instruct(task_description, query) for query in queries]

q_vecs = model.encode(queries, normalize_embeddings=True)
p_vecs = model.encode(passages, normalize_embeddings=True)

print(np.dot(q_vecs, p_vecs.T)) 
#[[0.6369  0.2092 ]
# [0.2499  0.8411 ]]

License and Uses

GeoEmbedding is liscensed under Apache License 2.0. GeoEmbedding is trained on the foundation of Mistral-7B-Instruct-v0.1, which is also licensed under the Apache License 2.0. It is your responsibility to ensure that your use of GeoEmbedding adheres to the terms of both the GeoEmbedding model and its upstream dependency, Mistral-7B-Instruct-v0.1.

The model is not intended for use in any manner that violates applicable laws or regulations, nor for any activities prohibited by the license agreement. Additionally, it should not be used in languages other than those explicitly supported, as outlined in this model card.

Limitations

GeoEmbedding is trained on English datasets, and performance may be suboptimal for other languages.

GeoGPT-Research-Project
/

GeoEmbedding

GeoEmbedding

Quickstart

License and Uses

Limitations

Model tree for GeoGPT-Research-Project/GeoEmbedding

Collection including GeoGPT-Research-Project/GeoEmbedding

GeoGPT-RAG