Squash data files by merging repeated sublists#38
Squash data files by merging repeated sublists#38ChrisJefferson wants to merge 2 commits intogap-packages:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #38 +/- ##
==========================================
- Coverage 99.85% 97.81% -2.05%
==========================================
Files 5 5
Lines 709 733 +24
==========================================
+ Hits 708 717 +9
- Misses 1 16 +15 🚀 New features to boost your workflow:
|
|
Nice! With this, I think further savings are possible. The largest file is There are more patterns of this kind. It seems plausible to me that using this, the file could compressed quite a lot more (and likewise several other of the largest files) |
|
I had a quick look at that, I could push it harder, but some quick attempts ended up bigger when gzipped. This helps the data sets linked on the front page even more, for example Endom128 goes from 520MB to 12MB, Endom243 goes from 251MB to 2.1MB (I'm currently running it on all of them) |
|
The script should work on everything, I can't run the biggest Endom32 files, as they are bigger than 3GB uncompressed, and I'm currently on a 16GB laptop, I can't even load the file into GAP, never mind do anything with it :) |
|
This is all files still gzipped: |
|
Based on @fingolfin , I added some simple run-length encoding: |
07476fa to
a57629f
Compare
|
Now 2.1MB (compressed). I added a function |
|
Awesome! |
|
This also means that this package could now just ship all the data files, and it would still be smaller than version 1.0.4. (The result should perhaps then be 1.1.0 and not 1.0.5 ...) |
|
This is impressive! i wonder if one could make Also, I opened two files for different orders, and both have the same line Is there a way to eliminate this duplication (I guess there is more)? |
|
Just as an update, I'm trying this, and a couple more things, I managed to hit a limit of GAP's parser, which is impressive. I did wonder if I should convert these to JSON instead, but let's keep the same general file format for now. |
|
Endom is now 12MB and includes all files. I want to review a few of these (I made an issue which noted a couple of them are broken -- also the very largest ones I can't load as I don't have enough memory, so I'm fairly sure they are right but I can't check and I'd like someone to just check they are the same as the original files when hashed) |
|
@ChrisJefferson many thanks - I have noticed in the file : Is this snippet the same in each file? Could it be added to the package then? |
|
Please don't add a function called So if it was up to me, I'd leave this snippet in. But of course in the end: you do what you deem best :-) |
|
Thanks @fingolfin! Clearly, if added, it would have some other name and not |
This is an alternative to #23 , which starts by looking for repeated sublists.
The basic idea is we first make a list of every list of length one, then output them, for example,
Endom16_2-8_5.txtbecomes:Now, for smaller instances, gzip does a fairly good job of detecting these repeats, but for larger ones it's very useful -- this squashes Endom down to 8.3M.
I made these using a little GAP function, squish.g (in the attached zip file), it takes an input file name and output file name, and makes a 'squished' output file. As part of the function I read the file and check the value of P is the same, but we should of course double-check this carefully before merging this.
squish.zip