Problem 2241

Summary: Support for zlib compressed RealSurface data
Product: Geant4 Reporter: Wouter Deconinck <wdconinc>
Component: processes/opticalAssignee: Daren Sawkey <daren.sawkey>
Status: RESOLVED FIXED    
Severity: enhancement    
Priority: P4    
Version: 10.6   
Hardware: All   
OS: All   
Attachments: Patch with G4OpticalSurface::ReadLUTFileToStream

Description Wouter Deconinck 2020-04-30 16:22:37 CEST
Created attachment 619 [details]
Patch with G4OpticalSurface::ReadLUTFileToStream

The largest uncompressed data set in geant4 is current RealSurface2.1.1 at 783 MB. This can be compressed using the zlib support already included (and used for G4NDL4.6) to 128 MB. This is a relevant savings in disk space when distributing containers which include geant4 data.

The attached patch takes the approach from G4NDL in G4ParticleHPManager::GetDataStream and applies it to G4OpticalSurface. The new function G4OpticalSurface::ReadLUTFileToStream reads the specified filename to an istringstream. It attempts to read filename.z first, falls back to filename, and returns a bad istringstream if both fail. The difference in initialization time during reading is minimal and I was unable to assess it in OpNovice (which uses one of the LUTs).

A centralized approach to reading .z files could be desirable as well, but at least G4ParticleHPManager does some additional things in GetDataStream and includes a separate function to determine if an appropriate file exists (filename.z or filename).

To compress the RealSurface LUT files with zlib (not gzip), run `pigz -z $G4REALSURFACEDATA/* && mmv '$G4REALSURFACEDATA/*.zz' '$G4REALSURFACEDATA/#1.z'` because pigz automatically adds .zz as extension, not .z.
Comment 1 Daren Sawkey 2020-05-13 06:39:08 CEST
Thanks for suggesting this. I agree it is not efficient to store the data files as ASCII text. I wonder, have you investigated the speed vs. size trade-off for storing the data as binary vs. zipped ASCII?
Comment 2 Wouter Deconinck 2020-05-14 23:35:59 CEST
I have not investigated whether binary would be even 'better' than zipped ASCII. The advantage of zipped ASCII is that it is easy for users to figure out what is in the data files. Binary data does not seem to be how the data files are stored for other processed in geant4, so I didn't think it would be accepted.

As an example, Rough_LUT.dat is 7.280M float values (4 bytes each) of average 12 ASCII characters (at 1 byte each; this includes exponent E and EOL characters), for 27MB as pure binary and 85MB as ASCII. Now, the ASCII can compress the many (1.7M) lines of just 0.000000 in that file, so the zipped ASCII is only 17M.
-> zipped ASCII is more efficient in terms of stored file size (somewhat counter-intuitively, and dependent on the data file)

At 17MB vs 27MB, an anecdotal deflate decompression speed of 323 MB/s (https://cran.r-project.org/web/packages/brotli/vignettes/brotli-2015-09-22.pdf), and an anecdotal SSD (HDD) read speed of 550 MB/s (125 MB/s), this means an uncached difference in time of 0.08s (0.19s) for zipped ASCII and 0.05s (0.21s) for pure binary.
-> pure binary is faster on SSD but slower on HDD
Comment 3 Wouter Deconinck 2020-05-14 23:39:23 CEST
I do have to add that since posting this, I have found that the patch can introduce a race condition on multithreaded running (hangs on a futex call). I'm still trying to determine where exactly it is coming from, and how to fix it, but it is reproducible and clearly attributable to this patch.
Comment 4 Daren Sawkey 2020-10-22 04:17:43 CEST
The RealSurface data files will be zlib-compressed in the upcoming release (10.7). Thanks for your suggestion.