Skip to content

Squash data files by merging repeated sublists#38

Open
ChrisJefferson wants to merge 2 commits intogap-packages:masterfrom
ChrisJefferson:squish-data
Open

Squash data files by merging repeated sublists#38
ChrisJefferson wants to merge 2 commits intogap-packages:masterfrom
ChrisJefferson:squish-data

Conversation

@ChrisJefferson
Copy link
Copy Markdown
Member

This is an alternative to #23 , which starts by looking for repeated sublists.

The basic idea is we first make a list of every list of length one, then output them, for example, Endom16_2-8_5.txt becomes:

local P,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10;
A1:=[1,4,2,1,4,7,4,1,2,7,4,7,2,1,7,2];
A2:=[1,2,16,4,11,10,7,14,13,6,5,15,9,8,12,3];
A3:=[1,1,4,1,1,4,1,1,4,4,1,4,4,1,4,4];
A4:=[1,1,5,1,1,5,1,1,5,5,1,5,5,1,5,5];
A5:=[1,4,8,1,4,14,4,1,8,14,4,14,8,1,14,8];
A6:=[1,2,13,4,11,15,7,14,16,12,5,10,3,8,6,9];
A7:=[1,7,6,4,11,3,2,8,12,16,5,9,15,14,13,10];
A8:=[1,1,14,1,4,14,1,4,14,8,4,14,8,4,8,8];
A9:=[1,4,11,1,1,5,4,4,11,11,1,5,5,4,11,5];
A10:=[1,1,7,1,4,7,1,4,7,2,4,7,2,4,2,2];
P:=[
[A1,A2,A3,A4,A5,A6,A7],
[A1,A2,A3,A4,A8,A6,A7],
[A1,A2,A3,A9,A5,A6,A7],
[A1,A2,A3,A9,A8,A6,A7],
[A10,A2,A3,A4,A5,A6,A7],
[A10,A2,A3,A9,A5,A6,A7]
];
return P;

Now, for smaller instances, gzip does a fairly good job of detecting these repeats, but for larger ones it's very useful -- this squashes Endom down to 8.3M.

I made these using a little GAP function, squish.g (in the attached zip file), it takes an input file name and output file name, and makes a 'squished' output file. As part of the function I read the file and check the value of P is the same, but we should of course double-check this carefully before merging this.

squish.zip

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 97.81%. Comparing base (c6a04d2) to head (56a69a6).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #38      +/-   ##
==========================================
- Coverage   99.85%   97.81%   -2.05%     
==========================================
  Files           5        5              
  Lines         709      733      +24     
==========================================
+ Hits          708      717       +9     
- Misses          1       16      +15     

see 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fingolfin
Copy link
Copy Markdown
Member

Nice! With this, Endom is just 92 MB for me without compression (instead of over 2 GB). Which means it could be stored uncompressed in the repo and only be compressed for releases.

I think further savings are possible. The largest file is Endom/32/Endom32_37-16_11.txt. The content is highly structured. For example, lines 190-16573 start with A22; they later repeat, just with the first entry changed to A61.

There are more patterns of this kind. It seems plausible to me that using this, the file could compressed quite a lot more (and likewise several other of the largest files)

@ChrisJefferson
Copy link
Copy Markdown
Member Author

I had a quick look at that, I could push it harder, but some quick attempts ended up bigger when gzipped.

This helps the data sets linked on the front page even more, for example Endom128 goes from 520MB to 12MB, Endom243 goes from 251MB to 2.1MB (I'm currently running it on all of them)

@ChrisJefferson
Copy link
Copy Markdown
Member Author

ChrisJefferson commented Apr 30, 2026

The script should work on everything, I can't run the biggest Endom32 files, as they are bigger than 3GB uncompressed, and I'm currently on a 16GB laptop, I can't even load the file into GAP, never mind do anything with it :)

@ChrisJefferson
Copy link
Copy Markdown
Member Author

This is all files still gzipped:

  ┌──────────┬───────┬──────────┬──────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │
  ├──────────┼───────┼──────────┼──────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │
  └──────────┴───────┴──────────┴──────────┴───────┘

@ChrisJefferson
Copy link
Copy Markdown
Member Author

ChrisJefferson commented Apr 30, 2026

Based on @fingolfin , I added some simple run-length encoding:

  ┌──────────┬───────┬──────────┬──────────┬───────┬─────────┬───────┐
  │ Database │ Files │ Original │ Squished │ Ratio │  Delta  │ Ratio │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom128 │ 332   │ 520 MB   │ 12 MB    │ 2.2%  │ 2.8 MB  │ 0.54% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom243 │ 24    │ 251 MB   │ 2.1 MB   │ 0.8%  │ 386 KB  │ 0.15% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom32  │ 8     │ 228 MB   │ 33 MB    │ 14.4% │ 2.5 MB  │ 1.10% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom625 │ 47    │ 149 MB   │ 1.2 MB   │ 0.8%  │ 798 KB  │ 0.52% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Endom64  │ 10    │ 29 MB    │ 2.0 MB   │ 6.9%  │ 916 KB  │ 3.08% │
  ├──────────┼───────┼──────────┼──────────┼───────┼─────────┼───────┤
  │ Total    │ 421   │ 1.2 GB   │ 50 MB    │ 4.17% │ 7.3 MB  │ 0.62% │
  └──────────┴───────┴──────────┴──────────┴───────┴─────────┴───────┘

@ChrisJefferson
Copy link
Copy Markdown
Member Author

Now 2.1MB (compressed). I added a function ProduceSHAs which reads a directory of files which can be loaded with ReadAsFunction, calls HexSHA256 on each of their outputs, and I used this to check I'm generating exactly the same files (this file squish.g should probably be stored somewhere, like this repo, but I'm not sure if it's worth making visible to users, probably not).

squish.g.gz

@fingolfin
Copy link
Copy Markdown
Member

Awesome!

@fingolfin
Copy link
Copy Markdown
Member

This also means that this package could now just ship all the data files, and it would still be smaller than version 1.0.4.

(The result should perhaps then be 1.1.0 and not 1.0.5 ...)

@olexandr-konovalov
Copy link
Copy Markdown
Member

This is impressive! i wonder if one could make A a list, then P will have numbers of positions in the list A, instead of variables A1, A2 etc. You'd have to put n pairs of [...] if A has the length n, but if P is sufficiently long, you save on not needing to type "A" each time. But then instead of return P you would have

return List(P, t -> List(t, i -> A[i]));

Also, I opened two files for different orders, and both have the same line

_R:=function(rows) local r,p,e,i; r:=[]; p:=[]; for e in rows do if Length(e)>0 and IsInt(e[1]) then p:=ShallowCopy(p); for i in [1,3..Length(e)-1] do p[e[i]]:=e[i+1]; od; else p:=e; fi; Add(r,p); od; return r; end;

Is there a way to eliminate this duplication (I guess there is more)?

@ChrisJefferson
Copy link
Copy Markdown
Member Author

Just as an update, I'm trying this, and a couple more things, I managed to hit a limit of GAP's parser, which is impressive. I did wonder if I should convert these to JSON instead, but let's keep the same general file format for now.

@ChrisJefferson
Copy link
Copy Markdown
Member Author

Endom is now 12MB and includes all files. I want to review a few of these (I made an issue which noted a couple of them are broken -- also the very largest ones I can't load as I don't have enough memory, so I'm fairly sure they are right but I can't check and I'd like someone to just check they are the same as the original files when hashed)

@olexandr-konovalov
Copy link
Copy Markdown
Member

@ChrisJefferson many thanks - I have noticed in the file :

_R:=function(d) local r,p,i,k,j; r:=[]; p:=[]; i:=1; while i<=Length(d) do k:=d[i]; if k<0 then p:=List([1..-k],j->A[d[i+j]]); i:=i+1-k; else p:=ShallowCopy(p); for j in [1..k] do p[d[i+2*j-1]]:=A[d[i+2*j]]; od; i:=i+1+2*k; fi; Add(r,p); od; return r; end;

Is this snippet the same in each file? Could it be added to the package then?

@fingolfin
Copy link
Copy Markdown
Member

Please don't add a function called _R to your package. I like that it currently is isolated to the data files. I also note that it is just 256 bytes (with some potential to make shorter by removing several of the spaces). Anyway, that also means that the saving potential is not great. I also like that right now the data file "compression" is self-contained and standalone: so the compact files could even be used in an older LocalNR version. This would be lost if you start moving code like that into the package.

So if it was up to me, I'd leave this snippet in.

But of course in the end: you do what you deem best :-)

@olexandr-konovalov
Copy link
Copy Markdown
Member

Thanks @fingolfin! Clearly, if added, it would have some other name and not _R but I think the interoperability is the key selling point for me here - then so be it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants