Problem 1390

Summary: Tracking not deterministic, differences up to 9 mm over a path of 1 meter
Product: Geant4 Reporter: Tom Roberts <tjrob>
Component: processes/electromagneticAssignee: Vladimir.Ivantchenko
Status: RESOLVED INVALID    
Severity: critical    
Priority: P5    
Version: 9.5   
Hardware: All   
OS: All   

Description Tom Roberts 2012-11-15 03:42:19 CET
I have spent several days tracking down a bug that causes tracking differences as large as 9 mm, over a path of 1 meter in liquid hydrogen, using 200 MeV/c mu+. I'm using QGSP_BERT, geant4-09-05-patch-01, comparing results from two different runs of the same executable on the same hardware (one is ordinary, one uses MPI to run on multiple cores -- I have established that MPI is not the problem; this executable can run without MPI on one core, or with MPI using multiple instances on multiple cores [this NOT multi-threading]).

I found it: G4WentzelOKandVIxSection.cc does this on line 149:
  if(Z != targetZ || tkin != etag) {
Both tkin and etag are type G4double, and that test for inequality screws up on a few percent of the tracks. Initially the differences are only a few parts per million, but they quickly expand to as much as 9 mm over 1 meter!

Replacing that line with this solves the problem:
  if(Z != targetZ || fabs(tkin-etag) > 1E-6*MeV) {

This code is essentially caching the results of a computation, so it is not obvious why this fails so badly. Nor do I know why the presence or absence of MPI affects the results. But I do know that with that one-line fix the results are equal to the accuracy printed (typically 6-8 digits for track positions and momenta). 

Note this may well be system dependent; my testing was primarily on a Mac Pro running Mac OS X 10.7.5 (Lion). But it also fails in a similar manner on hopper.nersc.gov (runs some recent version of Cray Linux); the same code change fixed it. Both are 64-bit builds.

NOTE: I do not know which run (before the fix) is correct; perhaps neither is.

It is well known that testing for equality of doubles is a VERY bad idea. I had not realized how bad it is. During my testing I found that a simple "a = b;" for doubles is not always 64-bit clean.

Perhaps it would be a good idea to search the entire Geant4 codebase for such equality tests of doubles and floats.
Comment 1 Tom Roberts 2012-11-15 03:48:20 CET
I forgot to mention that I am of course seeding the random-number generator identically for all runs. As part of my testing I replaced G4UniformRandom with a version that printed the value to 16 digits; they were identical in all runs.
Comment 2 Vladimir.Ivantchenko 2012-11-16 22:53:47 CET
Hello,

If your observation is correct then you face major non-reproducibility of Geant4. Many tests done inside G4 collaboration were done but always reproducibility was confirmed. So, what you are saying is exceptional and very difficut to believe. Please, rebuid your application from scratch without any tail of MPI and double check that in ordinary sequential application you cannot reproduce the result if you run on the same PC with the same initial random seed. 

If you still belive in your observasion than, please, sent tar file of the application. 

Also note, that your proposal is not acceptable - this is none to be a bad practice to add "tolerances" in cashes. Better simply remove such "if" statement and recomput everything. Comparison of doubles instead is normal approach - this cannot be a problem at all.

VI
Comment 3 Vladimir.Ivantchenko 2012-12-06 20:08:34 CET
Because no new information received the bug report should change it status. If problem will be confirmed a new bug report should be opened.

VI