AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations
Here are the first three surge trained experts. They should encompass almost any need if used correctly.
The image line.
The specific image trained SVAE structures are dubbed;
- SVAE-Fresnel
tiny - 64x64
small - 128x128
base - 256x256 <- cooking current MSE=0.000181 -> Operating CV: 0.3769
large - 512x512 <- upcoming
xl - 1024x1024 <-upcoming
xxl - 2048x2048 <-upcoming
giant - 4096x4096 <-upcoming
The initial Fresnel shows the model can reconstruct images far out of scope at entirely different sizes, entirely never seen images can be fully reconstructed within the same spectrum of MSE as the trained images.
Tests show;
- the Fresnel models can piecemeal images back together at a higher accuracy and lower error rate than running the full model. Tested up to 1024x1024 with near perfect reconstruction. 0.0000029 MSE
Fresnel CANNOT reconstruct noise directly; 1.0~ MSE
The 256x256 variant is cooking right now. The MSE is dropping rapidly and it's nearly as accurate as the 128x128 counterpart with only partial cooking.
The noise line. the specific noise trained SVAE structures;
- SVAE-Johanna
This model is capable of learning and reconstructing noise and this will train a noise compressor that can deconstruct/reconstruct any noise automatically with it.
tiny - 64x64 <-first train faulted, tried 16 types of noise out of the gate, going to restart with curriculum training.
small - 128x128 <-gaussian prototype ready = 0.012 MSE <- back in the oven 16 spectrum noise
small - 128x128 - 16 noise; <- MSE=0.053170 CV=0.4450 -> learning 16 noise types
base - 256x256 <- upcoming
large - 512x512 <- upcoming
xl - 1024x1024 <-upcoming POSSIBLE if large works
Johanna is being trained on 12 types of noise. The MSE is dropping as expected and the noises are in fact being learned and represented to be replicated.
The text line is exactly the same as the others.
-SVAE-Alexandria
Alexandria is meant to encode/decode text in a perfect or near-perfect reconstruction capacity.
AbstractPhil/geolip-SVAE
Epoch 1 test recon error 0.0064
Epoch 2 test recon error 0.0022
Epoch 8 is now 0.000294
Epoch 12 is now 0.000206
Epoch 14 is now 0.000190
Epoch 18 is now 0.000187
Epoch 24 is now 0.000117
Epoch 30 landmark 0.000099
There are NO EXPERTS HERE. This is pure self learning. The model learns the entire behavioral set within 1 epoch to reconstruct imagenet's test set to a useful state. By epoch 12 a recon of 0.000202 recall is now measured. This means, 99.99% accuracy at RECONSTRUCTING the test set through the bottleneck, while simultaneously leaving a trail of centerwise extraction as rich or richer.
ONE epoch. Just one.
Took about 10 minutes to train an already converged epoch, and I set it up for 200 epochs. This model will not need 200 epochs. I'd be surprised if it needs 3.
What you're looking at here, is the emergence of surge resonance. The power of a single epoch when the geometric CV alignment hits the tuning fork of absolute resonant perfection and counterpointed with the concerto's dissonant harmonic response.
I give you, surge resonance.
The metrics will be ready by morning and I'll begin building utilities to figure out what went right and what went wrong.
This model is rewarded when it exists within the geometric spectrum while simultaneously dual punished when leaving. There is no benefit to stray, and the benefit to exist within prevents the model from leaving the validated CV band.
This allows the model to exist perfectly within the tuning fork resonance structure.
The model CONTINUES to refine, even when the CV drift has begun to drift away from home. The model has left home and is now seeking new proximity.
Upcoming training will be the 256x256, 512x512, 1024x1024, and larger if the model holds. Each will be named.
I see the answer. The behavioral sweep shows CV of 0.29154 between 0.291 and 0.292 are within a very special band of variations.
1024v, 24d - the entire operating spectrum of the T5 series embeddings when alignment differentiated by the configuration. This is effectively a threshold between what works operationally, going beyond this causes degraded behavioral response without attenuated compensation.
So I've managed to finally get the right questions to discover the connection between the fly in the ointment that kept returning, and the structural systems responsible for curating the behavior around it.
Finding: Geometric controlled structures do not require CV loss if the D is within the expected band. To compensate for the dimensional difference with the CV measured, the CV loss must be adjusted to the distillation target.
The vocabulary when established as geometrically valid throughout the lifecycle of the existence. The CV loss is only attuned and useful when running distillation paradigms. The current CV loss has no impact on the CV measured capacity of the embeddings consistent or pretrained.
This effectively allows compartmentalization to any vectorized locality as accumulated throughout a structure, allowing direct
I'll make this brief and to the point.
GEOLIP is an observer system at it's core. It watches, triangulates, and assists with correct answers.
Many experiments worked very well, many fell down and turned into a pile of broken circuits. The recent geometric-transformer being one of my biggest fumbles, still taught me many things about what I'm TRULY trying to accomplish here.
**Save money and lives**. Less hardware use for less need at inference. Train more calculations into a more reusable and accurate structure for near instant zero-shot or sequential inference.
In the process v8 unlocked a missing puzzle piece, EMA trajectory alignment compensation. I'm doing my best to build something that works.
The geolip distillation system is very powerful but requires much experimentation still.
* Genetic experiments were successful
* Data transfer experiments successful
* Analysis experiments successful - and expand large model accuracy
* Many distillation experiments were successful.
* The largest successes being the kernels, the distillation tools, and the geometric analysis systems.
With the good comes the bad, the faulty VITs, the simultaneous trains that fault, the internalized confusion that happens occasionally.
*** The observer NEEDS something to OBSERVE. If the observer observes the progressive development of point cloud structures, it learns how to observe THAT LEARNING PROCESS - drifting fault assessment.
*** In the process it DOES NOT learn how to improve the CE relations by embedding and compensating with anchored triangulation opinions.
BIGGEST CONCLUSION. Staged curriculum training.
These components must be DECOUPLED. One must be a compounding structural awareness beacon, the other must be an informationally aligned composition in a utilizable fashion.
This means stage-by-stage freeze/unfreeze processing. Independent task-oriented structural alignment.
As of right now I don't know how to reduce to fp16 without a massive dip. I'm thinking it's possible to utilize integers directly instead of high-accuracy fp64 or fp32 deviated floats. I'll do some exploration.
Reducing this is to fp16 or bf16 capacity would greatly improve performance, and if the values out are close enough to the mantissa cross-contaminants, it could be worth it just for the semi-accurate speed alone.
Per-instance allocation for max_n, max_batch (B):
WORKING STORAGE:
A_work : [B, max_n, max_n] # working copy (destroyed)
V_accum : [B, max_n, max_n] # eigenvector accumulator
householder : [max_n-2, B, max_n] # stored reflectors (padded)
d : [B, max_n] # tridiagonal diagonal
e : [B, max_n-1] # tridiagonal off-diagonal
Subtotal: ~3 ร max_nยฒ ร B floats
D&C TREE (depth = โlogโ(max_n)โ levels):
FOR each level l (0 to depth-1):
num_sub = 2^l
sub_size = max_n // 2^l (padded up to power of 2)
delta : [B, num_sub, sub_size] # merged eigenvalues
z_vec : [B, num_sub, sub_size] # merge vectors
rho : [B, num_sub] # coupling strengths
mask : [B, num_sub, sub_size] # valid element mask
# Newton state (per root):
lam : [B, num_sub, sub_size] # current root estimates
lo : [B, num_sub, sub_size] # bracket lower
hi : [B, num_sub, sub_size] # bracket upper
f_val : [B, num_sub, sub_size] # secular function value
converge: [B, num_sub, sub_size] # convergence mask
# Eigenvector fragments:
V_frag : [B, num_sub, sub_size, sub_size]
Subtotal per level: ~(9 ร sub_size + sub_sizeยฒ) ร num_sub ร B
Total across levels: since num_sub ร sub_size = max_n at every level,
โ (9 ร max_n + max_nยฒ) ร depth ร B
โ max_nยฒ ร depth ร B (the V_frags dominate)
CONCRETE NUMBERS (fp32, 4 bytes each):
max_n=8, B=4096: ~8ยฒ ร 8 ร 3 ร 4096 ร 4 โ 24 MB
max_n=32, B=1024: ~32ยฒ ร 5 ร 3 ร 1024 ร 4 โ 60 MB
max_n=64, B=512: ~64ยฒ ร 6 ร 3 ร 512 ร 4 โ 144 MB
max_n=128, B=256: ~128ยฒ ร 7 ร 3 ร 256 ร 4 โ 352 MB
max_n=256, B=128: ~256ยฒ ร 8 ร 3 ร 128 ร 4 โ 768 MB
max_n=6, B=8192: ~6ยฒ ร 3 ร 3 ร 8192 ร 4 โ 6 MB โ your CM case
Alignment in these systems is NOT a series of opinions, nor is it some sort of structural behavior, nor is it whether the model is inherently "good" or "bad".
Alignment is specifically a geometric process that enables direct resonant oscillation, and with that resonance perfectly aligned the substructure learns internal alignment to that behavior. The curves look like jagged broken waveform lines, and when the model comes out it's forged in steel.
More opinions simultaneously will yield more experimental waveform potentials. I will find the most ideal conditions for self learning and then the findings will be published in many languages, with hundreds of citations, countless experiments leading from A to B, and a massive series of optimizations required to reach this point from where I began.
A trained omega predictor will allow heavy task-refined LLM protections of the geometric lookup tables.
This will include multiple curriculum operations for finetunes such as medical processes, law practices, multilingual shared vocabulary learning, multistructural lookups for cross-tool comparison and utility, and many other useful rapid learning processes that can be directly compartmentalized, snapped on, snapped off, and so on - similar to the methodology of a lora.
Except this is... this is no Lora. This is far more deep and when perfected will train far faster as shown by the Bertenstein, Vit x3, Vit x34, clip L and clip G ctx extensions, and the CaptionBert models. They converge rapidly and retain their cohesion. This system will allow those very models to stand on their own without the experts present while simultaneously learning rapid alignment R@1 recall capacity within the trained model itself.
They not only converged with R@1 being 100% recall capacity, multimodal variations such as Bertenstein showed you can deviate those using standard tokenization techniques with embeddings and encodings.
The mid-level experiments show;
student models DID require teachers to CONTINUE TRAINING.
BUT the students DID NOT require teachers to INFERENCE at full capacity.
The InfoNCE memory bank aligned through geometric distillation alignment processing allowed the students to not only stand - but stand on their own without the soups or teachers used to teach them.
This CaptionBert distillation is not a toy, it has genuine pragmatic use. By the time these experiments conclude, the CaptionBert and the entire chain of models trained - will be able to train without experts, will be able to learn from a MASSIVE amount of sources, SPECIFICALLY meant to RETAIN that data for utility without catastrophic forgetting. This will have it's own transformer structure hoisting the models up hand-in-hand with current-scale transformers and models as a cooperative companion.
These are purely cooperative collectives, not competition nor adversarial trainings at their core. Adversarial destroys the very subtlety of the instruction set, so it must be cooperative.
Omega is a very touchy formula conclusion; so without very specific measures protected by very specific structural boundaries, the omega structure will not predict correctly.
Omega must be computed in fp64, and the computation is miniscule compared to the full structure that sets it up. Everything must be orderly though, and everything orderly must be sterile.
Most of the CONTEXT elemental systems can be represented in FP8 while the majority of the geometric still requires minimum FP32 due to the way eigns and svd are calculated. Scatterpoint can reduce this but it will have performance dips without eigns and svd matching.
I'm currently working out an eig and eign kernel meant to operate specifically within a high degree of optimization for the use cases. This will evolve over time. When paired with the svd kernel, it will provide massive performance boosts for the direct use case, without impacting the overarching linear algebraic structure required for full solidity.
The WideRouter will enable multiple core new features; the predominant two for our next experiment are as follows.
1. Directly integrated multi-opinion constellation structures. This will enable dynamic compiled expansions internally within the structure for huge performance gains.
2. Controllable stage-by-stage compilation. Each stage can be compiled or not. SVD being notoriously non-compiler friendly due to the linalg.egens, I will be addressing this particular function DIRECTLY soon. There will be no quarter for graph breaks.
If the WideRouter causes any major bugs or breaks with your code, bad calculations, incorrect deviated gradients, twisted or contorted dtype outputs, or any major compilation errors; please don't hesitate to open a pull request. Claude and I will abruptly solve any major issues.
Once everything is perfectly in-line and the graph matches, the transformer will have massive geometric performance boosts for huge structural basins with multiple layers of depth.
I will be addressing the linalg.eig+eigh directly in conjunction with multiple argsort functions that are causing huge performance dips. As well as addressing every single use of .item() that can present itself in the compiler's path.
After this, the ensemble topological transformer will be a-go. Which will enable quaternion, FlowMagnitude, FlowAlignment, FlowVelocity, FlowVelocityQuaternion, FlowVelocityOrbital, FlowVelocityPentachoron, and multiple other flow matching systems that will improve performance by dominating amounts inline with minimal overhead cost due to the precomputed geometric structure.
The ensembles will feature multiple simultaneous batched and segmented forms of learning meant to train the oscillation omega predictor "Beatrix".
Self-distillation has shown improvement. I think most importantly I've discovered a core component that can be utilized as a geometric attention, the quaternion MHA. The constellation produces all the necessary information to allow the quaternion MHA to benefit from the information in a directly utilizable fashion.
The quaternion MHA is quite the vessel. It's bulky, has multiple MHA structures, and is shockingly effective in the process. I'll be refining this head in the coming days as a composite Procrustes alignment tool.
Geometric structure has a very high amount of informational accumulation potential, so a multi-series of MHA can capture a great amount of informational processing from those elements, if the elements are curated correctly and within the specifications.
I've taken the benchmarks of the model from 50% to 86-93% spearman utilizing a quaternion-oriented attention head.
This is getting dangerously close to 99.9% mutation detection accuracy, with a model deemed 50% accurate - all by extracting geometric features from the constellation and training the ensemble head with the correct rules.
These are spearman result logits. These are in fact detecting the results.
This is the power of what I'm doing. From 50% to 90% in 48 hours with a single GPU.
Training your own alignment only requires a piece of the dataset you wish to run and about 8 hours or so. Run it, fall asleep, check on it in the morning. It'll be ready. Extract features, train your head in minutes. The spearman will be nearly perfect.
I'm currently preparing what I consider to be the final head that will need to be created. The quaternion head, which will be specifically predictive based on an ensemble of four divergent-methodology heads, each specifically tasked to solve the SVD in conjunction with the features. This system should extract any little bit of differentiation that exists. The imaginary head is the most crucial. Explaining this requires an entire paper of it's own.
I call this imaginary head the "Cletus" head, as it's inherently lesser accuracy in relation to the others. However, without it the combination does not coalesce correctly. Without the Cletus, the model does not reach full cohesion. This head is the most crucial, because it has the hardest job. It's actually the one who returned from the battlefield with the blueprint to describe everything it saw.
I expect the sheer geometric alignment alone to yield a new form of Adam tuning specific to introspective analytical alignment and with that a new format of optimizer dedicated to geometric preservation in conjunction with informational data accumulation. I also expect a new methodology for larger-buffer data movement kernel-wise, a structural boundary for SVD limitations within full spectrum, a substructure measured collapse state of SVD when projected, and multiple other models that will have hiccups and growing pains.
These tools are all building to the end-state format, which will express everything simultaneously in order to combine the necessary data from many many forms of models together, without requiring direct tooling to each model simultaneously.
Such finalized tools will include a reusable pretrained geometric patchwork that exhibits all the necessary traits of a geometric structure in it's frozen state, capable of being finetuned quickly into any other state, or simply utilized as a lookup beacon with the correct geometric transformer alignment.
The geometric transformer, which is specifically a revamped format for the transformer intentionally designed with the structural preservation of the overarching structure in mind, rather than falling directly to the naturalistic entropy of immediate solution over larger-scale contribution. This system will not replace rope, it will contribute to the concept of long-concept preservation and work hand-in-hand with systems like rope, attention, and original transformers simultaneously. ROPE based models will benefit most from this structure, as they are already trained intrinsically with alignment and rotation at their cores.
The geometric transformer by design takes nth inputs in as variant states, and those are transformed internally. Utilizing this by it's default state will yield by design, but it will require tuning and curation for specific use cases no matter which case. This is conceptually familiar to those who use transformers, and simultaneously intimidating to those who understand what I'm describing I'd think. I myself am a little intimidated that I'm this close as-is.
There are multiple other prototypes at work all leading to the geometric transformer, which will be both an empirically superior utility to any of the utilities I currently use, and embody the very essence of the geometric structure that I'm currently working with as a full trainable data mutation operation - meant to directly attenuate the structure of the observation, to the expectation of the autograd and gradients.
Getting pretty close to a few pieces, but not there yet.
AbstractPhil/geolip-esm2_t33_650M_UR50D
This model is based on edm2 33 650m from facebook, assessed with specific benchmarks to be around 50% accurate or so. I'll be improving those numbers by self distillation spectrum. The models will never see the validation data while unfrozen. The full spectrum of training tools are visible.
This is the first self-distillation observer prototype, and it works. Not as rapidly as I had hoped, but it most definitely works. The SVD was the missing piece of geometric solidity required to preserve full rotational behavioral control. The kernel made this possible for rapid iteration, and the first results are coming in.
This inherits much of the functionality from the CLIP_L and CLIP_G memory banks, while benefitting from the advanced research I performed while extracting CaptionBert 5x bert pooled captions for target points.
The primary driving point here is the sheer data size - and the important contributions of that data size to a full construct of geometric aligned data. There is a massive amount of very specific information, all curated, perfectly labeled, and organized in a way that can be... well not so easily accessed, but I did find a few ways in.
This data is highly accurate and forged through life for billions of years. This is what is there, this is what is expected, and I have the tooling - stage by stage, to not only develop a solution for the problem, but to fully contribute to an improved version with minimal hardware requirement for training.
This is real expectation and the results are pouring in hourly, this can improve models beyond a reasonable baseline while preserving the baseline's correctness.
I've spent the better part of the day refactoring the geolip-core github code so it would be better inline with the actual findings, and I'm currently having claude build the models using the geofractal router system.
With that I'll enable a dummy-clause structure, so the component code files can be snipped out and work standalone as necessary. Due to the geofractal router's rigidity as a structure, geovocab's problematic multi-layer formatting for formulas that often have strange hardware quirks, and a few of the more reusable systems requiring modularity - I'll just build it like this to allow for just... USE MINE INSTEAD mentality.
You'll be able to just snip them out and use them, like many of the representations within the "models" experiment folders. They are simply standalone that may or may not snap onto pieces of the larger wholes.
geolip-core will be built specifically with the geofractal router structure in mind, inheriting it's strengths, weaknesses, and hardware control - while simultaneously having a wrapper that simply says: use standard pytorch instead.
Using standard pytorch will disable much of the functionality, but the components in their standalone forms WILL WORK. The pipelines are another story.
Keep my attribution and naming in the comments please, this is a testament to a very long series of research that resulted in solutions to problems rather than trying to introduce more problems. Attribution is all I wish, and you can make your fortune from that. I believe many of the great minds of the past would agree; Nikola Tesla I believe would agree, accreditation is all I want, but not for my name - for them and my humble contribution. The greats who put the pieces together and solved the biggest problems.
As we grow, the shadows of giants are cast upon the surfaces of life and stone. Work hard, progress steadily, expand your mind, build your skills, and by the time you have any time to look down... you will be casting a shadow of your own. As we grow old and our shadows grow, others are born and see the cast shadows. Encouraging them through the same process is all I know how to do.
Both unattuned scatterpoint2d and triton-aligned SVD are a cut above the rest by a large margin.
https://github.com/kymatio/kymatio
https://huggingface.co/blog/AbstractPhil/svd-triton-kernel-optimization
AbstractPhil/svd-triton
AbstractPhil/geolip-hypersphere-experiments
Most kymatio tests were done on standard pytorch models that yielded higher accuracy than simple conv or transformers before overfitting, but not in every instance. Most common tested low-count cifar10 and cifar100 instances yielded more for less. Those are in the hypersphere-experiments notebooks and are viewable via huggingface tensorboard metrics.
The accuracy, retention, agreement, disagreement, and sheer capacity of the refined SVD kernel shows that full Procrustes alignment is not just crucial to distillation, but also entirely representable within encoders themselves as students.
This structure can representationally re-impose layer-by-layer which is what I tested, and this capture system can behave as a global regularization system, a selector, a behavioral adjudication structure, an encoding solidification unit, a trajectory systemic accumulator, an anchored differentiation unit, and about 30 other tests show - all of the above simultaneously.
The preliminary rapid-iteration capable kernel shows that not only can these behaviorally represent utility, but the noise-drift can be directly accounted for using systems like GELU, drop path, dropout, and other elements to learn to ignore that very noise that accumulates.
Attention is now officially deemed valid when utilized based on the tests and examples allowing preserved geometric structure after attention selection.
This encoding structure is substantially more durable than I can give credit for.
Surge is coming, exactly as predicted. Late I admit.
Wrote a triton kernel to approximate SVD at around 15000x on blackwell architecture, while the standard torch.linalg.svd basically sits in a swamp of slow.
It only tackles small kernel sizes for now, 3x3, which is the current experiment's encoder paradigm. Standard SVD causes death by a thousand cuts when using small matrixes. Smaller matrixes provide a much more robust access to certain elemental linkages on many spectrum.
The formula isn't perfect. It's absolutely lightning quick though. The svd_trison.py file has a profiler.
https://huggingface.co/AbstractPhil/geolip-core/blob/main/svd_triton.py
Incoming the geofractal router structure.
It will provide the necessary hardware and software implications to create much larger structures and curate the code in a much more effective way to hardware control than simply leaving it up to random chance or ai.
As I expand the system I will be heavily testing to include systems like Ulysses, accelerate, and more to encompass a larger array of learning. This will be crucial to building a proper bert trainer as well.
