Squash data files by merging repeated sublists by ChrisJefferson · Pull Request #38 · gap-packages/LocalNR

ChrisJefferson · 2026-04-30T11:55:55Z

This is an alternative to #23 , which starts by looking for repeated sublists.

The basic idea is we first make a list of every list of length one, then output them, for example, Endom16_2-8_5.txt becomes:

local P,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10;
A1:=[1,4,2,1,4,7,4,1,2,7,4,7,2,1,7,2];
A2:=[1,2,16,4,11,10,7,14,13,6,5,15,9,8,12,3];
A3:=[1,1,4,1,1,4,1,1,4,4,1,4,4,1,4,4];
A4:=[1,1,5,1,1,5,1,1,5,5,1,5,5,1,5,5];
A5:=[1,4,8,1,4,14,4,1,8,14,4,14,8,1,14,8];
A6:=[1,2,13,4,11,15,7,14,16,12,5,10,3,8,6,9];
A7:=[1,7,6,4,11,3,2,8,12,16,5,9,15,14,13,10];
A8:=[1,1,14,1,4,14,1,4,14,8,4,14,8,4,8,8];
A9:=[1,4,11,1,1,5,4,4,11,11,1,5,5,4,11,5];
A10:=[1,1,7,1,4,7,1,4,7,2,4,7,2,4,2,2];
P:=[
[A1,A2,A3,A4,A5,A6,A7],
[A1,A2,A3,A4,A8,A6,A7],
[A1,A2,A3,A9,A5,A6,A7],
[A1,A2,A3,A9,A8,A6,A7],
[A10,A2,A3,A4,A5,A6,A7],
[A10,A2,A3,A9,A5,A6,A7]
];
return P;

Now, for smaller instances, gzip does a fairly good job of detecting these repeats, but for larger ones it's very useful -- this squashes Endom down to 8.3M.

I made these using a little GAP function, squish.g (in the attached zip file), it takes an input file name and output file name, and makes a 'squished' output file. As part of the function I read the file and check the value of P is the same, but we should of course double-check this carefully before merging this.

squish.zip

codecov · 2026-04-30T11:58:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.81%. Comparing base (c6a04d2) to head (56a69a6).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   99.85%   97.81%   -2.05%     
==========================================
  Files           5        5              
  Lines         709      733      +24     
==========================================
+ Hits          708      717       +9     
- Misses          1       16      +15

see 2 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fingolfin · 2026-04-30T12:43:42Z

Nice! With this, Endom is just 92 MB for me without compression (instead of over 2 GB). Which means it could be stored uncompressed in the repo and only be compressed for releases.

I think further savings are possible. The largest file is Endom/32/Endom32_37-16_11.txt. The content is highly structured. For example, lines 190-16573 start with A22; they later repeat, just with the first entry changed to A61.

There are more patterns of this kind. It seems plausible to me that using this, the file could compressed quite a lot more (and likewise several other of the largest files)

ChrisJefferson · 2026-04-30T12:54:38Z

I had a quick look at that, I could push it harder, but some quick attempts ended up bigger when gzipped.

This helps the data sets linked on the front page even more, for example Endom128 goes from 520MB to 12MB, Endom243 goes from 251MB to 2.1MB (I'm currently running it on all of them)

ChrisJefferson · 2026-04-30T13:05:22Z

The script should work on everything, I can't run the biggest Endom32 files, as they are bigger than 3GB uncompressed, and I'm currently on a 16GB laptop, I can't even load the file into GAP, never mind do anything with it :)

ChrisJefferson · 2026-04-30T13:42:19Z

This is all files still gzipped:

  ┌──────────┬───────┬──────────┬──────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │
  └──────────┴───────┴──────────┴──────────┴───────┘

ChrisJefferson · 2026-04-30T14:59:18Z

Based on @fingolfin , I added some simple run-length encoding:

  ┌──────────┬───────┬──────────┬──────────┬───────┬─────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │  Delta  │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │ 2.8 MB  │ 0.54% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │ 386 KB  │ 0.15% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │ 2.5 MB  │ 1.10% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │ 798 KB  │ 0.52% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │ 916 KB  │ 3.08% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │ 7.3 MB  │ 0.62% │
  └──────────┴───────┴──────────┴──────────┴───────┴─────────┴───────┘

ChrisJefferson · 2026-04-30T16:48:26Z

Now 2.1MB (compressed). I added a function ProduceSHAs which reads a directory of files which can be loaded with ReadAsFunction, calls HexSHA256 on each of their outputs, and I used this to check I'm generating exactly the same files (this file squish.g should probably be stored somewhere, like this repo, but I'm not sure if it's worth making visible to users, probably not).

squish.g.gz

fingolfin · 2026-04-30T16:54:57Z

Awesome!

fingolfin · 2026-04-30T21:20:52Z

This also means that this package could now just ship all the data files, and it would still be smaller than version 1.0.4.

(The result should perhaps then be 1.1.0 and not 1.0.5 ...)

olexandr-konovalov · 2026-04-30T21:56:03Z

This is impressive! i wonder if one could make A a list, then P will have numbers of positions in the list A, instead of variables A1, A2 etc. You'd have to put n pairs of [...] if A has the length n, but if P is sufficiently long, you save on not needing to type "A" each time. But then instead of return P you would have

return List(P, t -> List(t, i -> A[i]));

Also, I opened two files for different orders, and both have the same line

_R:=function(rows) local r,p,e,i; r:=[]; p:=[]; for e in rows do if Length(e)>0 and IsInt(e[1]) then p:=ShallowCopy(p); for i in [1,3..Length(e)-1] do p[e[i]]:=e[i+1]; od; else p:=e; fi; Add(r,p); od; return r; end;

Is there a way to eliminate this duplication (I guess there is more)?

ChrisJefferson · 2026-05-01T16:10:44Z

Just as an update, I'm trying this, and a couple more things, I managed to hit a limit of GAP's parser, which is impressive. I did wonder if I should convert these to JSON instead, but let's keep the same general file format for now.

ChrisJefferson · 2026-05-01T16:29:23Z

Endom is now 12MB and includes all files. I want to review a few of these (I made an issue which noted a couple of them are broken -- also the very largest ones I can't load as I don't have enough memory, so I'm fairly sure they are right but I can't check and I'd like someone to just check they are the same as the original files when hashed)

olexandr-konovalov · 2026-05-06T12:59:10Z

@ChrisJefferson many thanks - I have noticed in the file :

_R:=function(d) local r,p,i,k,j; r:=[]; p:=[]; i:=1; while i<=Length(d) do k:=d[i]; if k<0 then p:=List([1..-k],j->A[d[i+j]]); i:=i+1-k; else p:=ShallowCopy(p); for j in [1..k] do p[d[i+2*j-1]]:=A[d[i+2*j]]; od; i:=i+1+2*k; fi; Add(r,p); od; return r; end;

Is this snippet the same in each file? Could it be added to the package then?

fingolfin · 2026-05-06T14:14:17Z

Please don't add a function called _R to your package. I like that it currently is isolated to the data files. I also note that it is just 256 bytes (with some potential to make shorter by removing several of the spaces). Anyway, that also means that the saving potential is not great. I also like that right now the data file "compression" is self-contained and standalone: so the compact files could even be used in an older LocalNR version. This would be lost if you start moving code like that into the package.

So if it was up to me, I'd leave this snippet in.

But of course in the end: you do what you deem best :-)

olexandr-konovalov · 2026-05-06T14:41:11Z

Thanks @fingolfin! Clearly, if added, it would have some other name and not _R but I think the interoperability is the key selling point for me here - then so be it.

Squash data files by merging repeated sublists

a57629f

ChrisJefferson force-pushed the squish-data branch from 07476fa to a57629f Compare April 30, 2026 16:41

Add all data files in new more compressed format

56a69a6

fingolfin mentioned this pull request May 1, 2026

Reduce size of data files in Endom #23

Closed

This was referenced May 2, 2026

Split into main archive and optional downloads #11

Open

Data compression #4

Open

Conversation

ChrisJefferson commented Apr 30, 2026

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

ChrisJefferson commented Apr 30, 2026 • edited by olexandr-konovalov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJefferson commented Apr 30, 2026

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

fingolfin commented Apr 30, 2026

Uh oh!

olexandr-konovalov commented Apr 30, 2026

Uh oh!

ChrisJefferson commented May 1, 2026

Uh oh!

ChrisJefferson commented May 1, 2026

Uh oh!

olexandr-konovalov commented May 6, 2026

Uh oh!

fingolfin commented May 6, 2026

Uh oh!

olexandr-konovalov commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 30, 2026 •

edited

Loading

ChrisJefferson commented Apr 30, 2026 •

edited

Loading

ChrisJefferson commented Apr 30, 2026 •

edited by olexandr-konovalov

Loading