I think I can handle the corpus training with some runpod MI300s or get a cluster of A100s for a week or two. That should allow proper tuning based on the lexical rules of language, but I need to make sure EVERYTHING is PERFECT before I start pulling triggers on clusters.
AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations


Also my apologies for not updating the lattice vocabulary, I've been very swept up in direct testing and implementing models. It's been really fun setting all this stuff up.
The more it works, the more I get excited that the formulas I'm manifesting are cohesive representations of purpose rather than simple random convergence. I've altered them hundreds of times, but the pipeline goal is still present. Unified geometric vocabulary WILL be a universal language, not simply a tinker-toy, but instead a full lexical representation of potential with all manifested trajectory and solidification of grammatical, lexical, symbolic, and representative substructure.
It's at the point where time will tell HOW this system is useful. Even if it can DO ALL THAT, large scale adoption or even minimal scale adoption is up to how robustly useful and how many eyes end up on the topics with technical knowhow. It's already well beyond the IF this system will be useful, which means I feel obligated to at least continue kicking my legs until I get access to a speedboat.
Simply put, I've built this system for the eyes of the technical - with some very direct and representative understanding to the less technical available as well.

There are some saving graces though. You can probably house the entire purpose of a word in a 256d token; but you won't get all of the robust lexical and analytical behavioral responses required from the orthonormalization 5th so it will likely be less accurate than a 512d.
You can get some more utility from upscaling 256 to 512 and you gain some sparsity which allows more growth, with the negative elemental response of sparsity being filled with no meaning - which tends to confuse and build pockets of misrepresentation on projection.
Multiple overlapping projections are the most robust from what I've been observing; where you take the same token and blow it up multiple times for multiple different projection sizes. This has proven invaluable behavioral response from the geometry 4-5 with freeze/unfreeze has shown that all layers can complementarily improve performance - while the final version can be any of them individually requested - as they are all experts on their own plane and the output does not require all of their outputs.
There are many potential variations of the models from these geometries - including 200+ projections implemented on the same model using the same tokens.
Pairs, triplets, quins, and penta word + letter combinations remain uncrystalized and unexplored, but I plan to use the same system to run them.
I'll likely implement a sentencepiece-esque translator that will turn a sentencepiece vocabulary directly into crystal variants with weighting for convenience, which will allow for much more utilizable and easy-to-represent vocabularies for expanding current models.
Wordnet with hard gated non-fabricated tokens has proven the most valuable, however they are still shallow and require full solidification and robustness curation with additional definitions and datasets.
Research is ongoing and many mechanisms still need to be created.

This one has many logistics issues. Primarily, there's no precedent I know of to literally train hundreds of millions of potential character combinations; with their prefabricated variations of crystals to tune a specific series of trajectories in specific directions, based on the input text targeting other crystals, the weights, and the batch. The dataset needs to be properly prepared though, and I can't find any prefabricated variations of this data format that the symbolic lexical engine needs to be robust.
There's a few possibilities for this one. Batch size being an obvious one, where I take a large influx of information in, then grab any matching words, characters, or information and update those using the formulas for topological tuning.
The main issue is the language web is massive. BILLIONS of variations can crop up from a single document if you're not hard capping depth; so if you traverse the whole tree like say - "the quick brown fox", becomes words, becomes definitions, becomes letters - not counting multi-pass finetuning. This alone is a massive logistics nightmare to implement, but thankfully this is the modern era.
Simply put; if I hard cap to 500k vocab with a depth of no more than 50,000 pentachora crystals each, it should be capable of housing the an approximate word structure within a trajectory space.
I'd rather run it on a fleet of devices and feed it the pile, the book corpus, and everything else so we can get some truly trajectory related subsets of 500k+ crystals per token upward to 100,000,000 or so combinations each. The crystals really aren't that big, and they house a massive amount of context.
Even so, there are many logistics nightmares to this, but it's a viable option for training a legitimate similarity-fed BERT or LLAMA meant to specifically form linguistic responses using those crystals as tuning forks for solidity.

More purpose with more careful organization... now we're talking.
I'm going heavy into lexical cardinality today and preparing a full crystal structured geometry that is full wordnet capable. Anything that isn't can be formed at runtime.
Full lexicality will include unigrams, 2-6 ngram counts from wordnet with frequency weights, usage, and a multitude of other elements. Each will be crystallized specifically. If you have any suggestions to making this more robust I'm all ears.
I could go with google books or something bigger, but I'm sticking to wordnet because it won't take me weeks to process entirely.
Crystal geometry will be given rich versions that include the correct lexical and organizational subsets specific to the lexicality and frequency of use, as well as the proper ascii, wordnet, and unicode sets.
For wordnet-rich; Each definition will attribute towards the overall goal of the upcoming crystals so the system will represent that goal proportionately through multiple crystals and trajectory concatenated rather than full concatenation like the current vocabulary is doing. Additionally, the frequency tokens will decide the orthogonal trajectory more carefully.
For testing and quick prototype purposes;
We will need to train a Bert variant that can house some capability of rapid geometric crystal prediction through ngram feature similarity, sentence similarity, sentence classification, and a few other bert traits that bert-beatrix-2048 is capable of. I know Bert can handle this at least - however Bert can't house the entirety of meaning so it will be imperfect... even so it will be considerably faster than trying to query the whole dataset every time you want a character, or preparing a massive vocab for rapid testing and iteration. Ask bert.
Not to mention feature extraction for training rapid classification heads with geometric subsystems, which are notoriously fast at training.

Well... geometry is a natural extension of this sort of thing, so naturally I'm in. I'll have something ready.

Sorry the language on that one is pretty terrible.
My geometric research continues and I'm not slowing down. The imagenet initial tests are complete and the largest model is currently preparing to cook. This big model I've named Goliath - is still very small in comparison to most CLIP variants.
Goliath has vit-maxx pretrained layers - in other words i've taken layers clean from the model, and given geometric attention between the frozen layers allowing them to codify and galvanize with the geometry.
It's a series of teacher/student introduced layers that unfreeze subsequent additional layers to introduce geometric learning as a replacement option for vit's vocabulary.
It's working... somewhat. It definitely needs much much more distillation to be ready, but she's cooking.
vit-max-goliath
Being substantially larger than anything geometric - I'm using the vit-max-tiny. So it's already far far more than overkill when it's tuned.
https://github.com/google-research/maxvit based on the maxvit variant of vit.
I really don't expect too much in terms of accuracy boosts, but it should convert directly to geometry without a big fuss.
Trying to do this with one of the LAION based models is beyond my resources as the distillation would require a large array of text captions just for the text portion.
HOWEVER, imposing geometry on a singular highly-compacted vit shouldn't be too problematic in terms of logistics. Geometry learns quick, and they are already pretrained with imagenet so this should combine. When it works I'll have a blueprint for a proper encoder hybrid that should solidify the full clip-vit-geometric hybrid between openai, laion, and google vits, clips, and model variant distillation to teach proper geometry to a clip model that can produce geometric-tuned features.
I expect a proper geometric feature to allow these to reach 95%+ on imagenet when training a random instantiated baseline geometric head.
After that, imposing a full translation matrix between geometry and feature geometry should be something I can distill into any clip-vit or vit variant - assuming they're even SOMEWHAT compatible with the predecessors.

Research shows, the most intelligent and most intellectually-driven LLMs require the most intelligent and carefully curated solid representative vocabularies - with the most intelligent and carefully curated training regiments.
Class simultaneously loaded hierarchical structures built with variants of vocabulary dimensions do not help this. Multiple dimensions of imagenet do not help this. Reshaping does not help. Solidification processes through pulverizing using Alucard do not help - though they did show some interesting potentials for pretraining the full geometric clip from the ground floor.
The experimentations with the multitude of clip features and imagenet - showcase that not only can this tiny 4meg classification tool can handle imagenet from clip features AT AROUND 76% no matter the hyperparams using linear, but expanding this system upward and including hundreds of different formula variants DOES NOT HELP SCALE IT AT ALL! The largest ones only house 76%, and the medium-sized ones house about 86% instead of 76% when using clip-vit-b-patch16 and clip-vit-b-patch32. If you check the big number valuations for the clip-vit-b laion and openai, you'll find nearly identical classifications.
So I only taught it, to understand geometry - the more training and more steps only brings it closer incorrectly.
So, this tells me one simple principle; geometry and linear have an upward capacity based on the information extracted from the linear model. Meaning... We need more places to extract and more curative potentials to solidify that access with, rather than simply EXPANDING it and making it bigger.
Next experiment includes a full cardinality subset of unicode to wordnet vocabulary translation matrices. Today. Within the hour.

Simply put; training something with features gives a fair representative of the learning that you would get from running a model that has some random chance - using a single seed.
Training with features does not need to wait for the representative model to actually generate; since you already generated everything ahead of time.
Features are rich and utilizable within the spectrum of similarity assessments, classification accuracy, mass-deterministic normalization checks, and more.
They are... put simply... exponentially faster and reusable for research. I'll include the notebooks used for imagenet and cifar100; as the cifar100 is much simpler since the cifar100 is much... smaller, I required less innovation.
Imagenet is another beast though. This imagenet notebook is capable of running against much larger datasets with a few tweaks.
clip-vit-bigG's imagenet feature set is complete, which means we're almost ready for full ablation.
Note to everyone; imagenet is meant for RESEARCH AND ACADEMIC PURPOSES ONLY; and you cannot use my trained imagenet weights - nor the features themselves as per the requests of the dataset's curators.
For commercial usage according to the rules of LAION's licenses, we'll be using the laion400m features; which will likely be heavily sought. I'll be preparing laion400m features on seed 42; which will take a while.
The full classifier is in the works; and with it comes a series of new formulas, new layers, new solutions such as the "fat belly" conversation piece that attenuates multiple branches in communication. The "dispatcher" which is a heavy classification gate trained to bypass that which is not useful; tuned with large amounts of data on a very low learn rate. The "attractant" which is specifically designed to catch bleed-over and unwanted information... which learns everything.
With that comes "PhaseGeometric" scheduling and "GeometricScheduling". Stay tuned.
Wheres the weights?

I've begun the task of properly tooling the lattice_vocabulary for future development and use with a multitude of geometric shapes - not just pentachora.
This experimental system here will house a multitude of additional capabilities;
https://github.com/AbstractEyes/lattice_vocabulary/tree/master/src/geovocab
I plan to implement out of order;
- simplified state_dict dictionary setup for direct manipulation
- โ full batching structure with iterations removed - utilizing the huggingface datasets columnar system.
- full transform callback for loading and curating pentachora lossless and deterministically.
- โ a full experimental callback system for transforming crystalized repos into other shapes than penta
- a simplified interface for converting large independent repos into geometric structure using transforms.
- a uniform configuration schema for geometric config so any geometric repo can be loaded automatically
- โ - ongoing - faster and more optimized load times for default loaders
- direct crystal training schemas for curating your own lattices with many different sources of information.
- a full task by task schema for multi-stage crystallization of your crystals so you can perfectly tune them for the use case using defined mathematics and callback capability for research and use-case mathematics.
As many systems suffer with allocating 4d I'll implement deterministic 4d calculations that ensure solidity and calculation cohesion without straying too far into "unknown" territory or requiring full pretrained systems to utilize. I haven't approached 6d or onward yet so we'll see if the human race even has the formulas for that when I actually approach the topic.

Current Splits;
* wordnet (english)
* unicode
AbstractPhil/geometric-vocab-32d
[32, 64, 128, 256, 512, 768, 1024]
Swap the 32d for the dimension within the list for the repo.
Okay, so the purpose of these; is to give solid anchors to the entire pentachora structure.
With that I've formatted some very concise sentencepiece-esque vocabulary classes that can be saved and loaded as pretrained, but it'll need some tinkering to fully flesh those behaviors out.
For now, the geometric vocab itself can be queried from pretrain but the canonical classes that help regulation, integration, special token usage, and integration aren't fully tested yet.
https://github.com/AbstractEyes/lattice_vocabulary
They are available here, but I give no guarantee on their current state. I'm currently preparing the pip package and have prepared a series of experiments to utilize these for different models including a new version of multimodal Beeper, a classifier set that can handle encodings as feature representations meant for utilization, and more.
The current working variation that I've been utilizing is Flow Matching Discreet Scheduled geometric diffusion - meaning I'm diffusing the GEOMETRY from the image, and then comparing that pentachora that is created from flow matching to the actual representative tokenization structure. On average this is achieving 80% in later stages.
This when curating an indefinite amount of special tokens to create manifests of unique vocabularies, enables the system to perfectly conform to use-cases.
There are some edge-cases where the 1k reserved tokens still exist; however this is currently replaced by an indefinite tokenization dictionary - allowing for an indefinite amount of tokens attached to an indefinite amount of modules for solidity.
Experiments continue.

Also apparently I had an incorrect forward attached to my model code for the app so I've since placed the correct forward code and will need to rewire it.

you might notice that little "thinking..." moment before it answers.
But what does it actually mean when an LLM is โthinkingโ?
Imagine a chess player pausing before their next move not because they donโt know how to play, but because theyโre running through possibilities, weighing options, and choosing the best one.
LLMs do something similarโฆ except theyโre not really thinking like us.
Hereโs the surprising part :-
You might think these reasoning skills come from futuristic architectures or alien neural networks.
In reality, most reasoning LLMs still use the same transformer decoder-only architecture as other models
The real magic?
Itโs in how theyโre trained and what data they learn from.
Can AI actually think, or is it just insanely good at faking it?
I broke it down in a simple, 4-minute Medium read.
Bet youโll walk away with at least one โaha!โ moment. ๐
Read here - https://lnkd.in/edZ8Ceyg

This is one of my primary study directions - not for the intent to produce "thinking" but with the intent to reproduce the very essence of "one-ness" that manifests as an eventuality when large corpus are trained on LLMS and then curated using specific paradigms in careful ways.
Imposing this behavior onto smaller models will hopefully allow the very curative potential of self-regulated intelligence that will require less curation and more conversation.
I'm exploring many variants of geometry to impose this exact behavior - with the end result goal of giving considerably smaller models the ability to understand how to pilot themselves, without needing 200 billion params.

You can examine beeper's full class corpus and the full emotional capoera within beeper v4's configuration if you're curious.
Everything is documented and recorded for solidity, and the classes can be accessed using the vertices directly - while the similarity of access can be shaped based on input similarity.
This is encoder tech built clean into a beeper, which I didn't think would work. Fun little experiment Beeper is.
The control group beepers will be uploaded into their own adjacent repos, the current uploaded versions don't match the control group beepers.

Each of the pentachora classifiers point to emotional states that Beeper can potentially access for any conversation, and each of those 7 states have class accessors for sub-learning pools.
Today I'll be focusing on drawing this behavior from Beeper v4 which I am rebranding as Beeper Micro - and expanding the structure using a new type experimental attention mechanism to replace traditional multihead attention dubbed GeometricCollectiveAttention.
This attention is similar to multihead attention, except it's considerably harder to burn at higher learn rates. This coupled with a new perspective on training pentachora into the LLM structure will allow a full relay structural system.
beeper-small will house a full rope - except not in the traditional vocabulary set. Beeper-small will not have a vocabulary.
beeper-small is my first non-linear non-Euclidean attempt to create a pure symbolic auto-completion LLM; which may be naiive according to many researchers who have tried similar systems historically.
I've personally analyzed many papers, many studies, and many techniques that have attempted similar non-vocabulary entropic learning, and I believe the pentachora lattice will hold with pure binary, not requiring a vocabulary.
Transformers really like vocabulary... beeper likes... geometry, and this experiment for beeper-small will have a new type of ROPE that is based entirely on vertices developed from the direct unicode represented characters, rather than a full vocabulary structure meant to bring solidity from chaos.
The first beeper experiment showed many insights into how similarity and internal classification responds mathematically with traditional ML techniques, and those techniques did not reject the construct - on the contrary. The control group placebo beeper, the traditional non-rose version BURNED under half lr. It's completely illegible, producing garbage and noise, while rose beeper sings


AbstractPhil/pentachora-greyscale-frequency-encoded
AbstractPhil/pentachora-multi-channel-frequency-encoded
They are essentially geometric crystallization engines that store an excess amount of information in a very constrained and tight location - capable of classification *within a fraction of the size of traditional linear systems* along with the added benefit of only needing minimal tuning and learning at a very high learn rate - yielding a very complex structural response to complex learning.
I have 3 more notebooks to prep and release for the full pentachora classification structure based on the Nikola architecture concepts, fused with many rules that govern physics, laws of conservation, atomic structural comparators, and many more experiments that were interesting but yielded less than anticipated for some.
The most robust representation is a representational geometric collective, a series of geometric experts capable of high-yield classification with multiple ongoing simultaneous opinions.
The quick training capability of these crystals have shown that they can be rapidly trained and discarded as massive collectives, pruning based on comprehensive capability and combining working geometry with the survivors - enabling the accuracy to reach very high levels that were unattainable with standard ML learning gradient loss paradigms without reaching into the large model spectrum.
I've since begun integrating them into LLMS and will be releasing the notebooks as they are prepared, along with decomposition and comparative studies for the most comprehensive and capable training paradigms, as well as proof of concept for additional capabilities and the full araxiv paper triad when the studies conclude.

I'm excited to share our latest theoretical work that formally proves an interesting property of large language models: base transformer models can approximate fine-tuned capabilities using only inference-time techniques like in-context learning.
The core question we investigated: Can specialized behaviors typically acquired through expensive supervised fine-tuning be elicited from base models without any parameter updates?
Our theoretical contribution: We provide a formal proof, grounded in the Turing completeness of transformers, showing that this is indeed possible under certain assumptions. The work establishes mathematical bounds on the minimal dataset sizes needed for approximation.
Key theoretical results:
- For text generation tasks: O(mV/ฮตยฒ) examples suffice (where m = number of contexts, V = vocabulary size, ฮต = error tolerance)
- For linear classification: O(d/ฮต) examples (where d = input dimension)
- Extensions to finite context scenarios with practical bounds
This work helps explain why techniques like few-shot prompting, retrieval-augmented generation, and in-context learning work so effectively in practice. It bridges formal computer science theory with empirical observations about modern language models.
While the assumptions are idealized (unbounded computational resources, full dataset access), the results provide mathematical foundations for understanding inference-time adaptation strategies that are increasingly important in AI deployment.
Paper: Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques (2506.08060)