Problem 1718 - Coding error causes irreproducible results
Summary: Coding error causes irreproducible results
Status: RESOLVED WONTFIX
Alias: None
Product: Geant4
Classification: Unclassified
Component: processes/electromagnetic (show other problems)
Version: 10.1
Hardware: All All
: P5 major
Assignee: Vladimir.Ivantchenko
URL:
Depends on:
Blocks:
 
Reported: 2015-02-28 18:31 CET by Tom Roberts
Modified: 2015-03-03 18:09 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this problem.
Description Tom Roberts 2015-02-28 18:31:01 CET
It is well known that comparing doubles for equality is dangerous. Unfortunately the Geant4 code does this, and it causes irreproducible results.

This particular one is in G4WentzelOKandVIxSection.cc at line 168.
OLD: if(Z != targetZ || tkin != etag) {
NEW: if(Z != targetZ || fabs(tkin-etag) > 1E-9) {

(Note that my choice of 1E-9 may not be optimal -- someone familiar with the code should do it right.)

The symptom was that running in sequential mode gave different results from MPI mode (using my MPI code, not the Geant4 MPI code). For my case, in MPI mode tkin and etag were the same to 14 significant digits, but not exactly equal; in sequential mode they were exactly equal.

This is definitely a coding error. And this is not the first time I have reported this particular coding error (but in different files).

Please educate your programmers to not do this. Even if they THINK it is OK because the code is just caching the results of a calculation, DON'T DO IT! 

It would be good to search the codebase for other instances, as I am fairly sure there are more that just happened not to fail my regression tests.
Comment 1 Andrea Dotti 2015-03-03 02:19:53 CET
Dear Tom,
    we have been evaluating your bug report carefully. We have decided to mark it as "won't fix" because we believe this is not the source of the irreproducibility you describe (in addition the change you propose will introduce more non-reproducibility according to our definition, some details at: https://twiki.cern.ch/twiki/bin/view/Geant4/MultiThreadingTaskForce#Reproducibility).

I understand you are using G4 10.1 without multi-threading. This configuration is checked periodically for reproducibility, and as I mentioned, the main production physics lists (with the exclusion of Radioactive Decay and NeutronHP, on both of which we are working) are reproducible: we check both strong and weak reproducibility (e.g. starting again job with the same random number and reproducing a specific event in the middle of a job) and we do not see problems.

I also understand that you are using your own MPI wrapper and the reproducibility issue you report is between an MPI-enabled and an MPI-disabled application (i.e. comparing two jobs running in two different conditions and with different execution paths). I would suggest to first verify and exclude that the problem is in the specific application code: if you still consider the problem to be on our side you should provide a receipt that shows a G4 stand-alone example with breaking of either strong or weak reproducibility.

Regards,
Andrea
Comment 2 Tom Roberts 2015-03-03 18:09:54 CET
I disagree with this "resolution". STRONGLY.

There is a reason that elementary textbooks on programming tell students to NEVER use equality comparisons for doubles. The reason is that programmers, especially physicist programmers, apply mathematics to their problems. There can be several different equations that could be used to calculate a given value, and MATHEMATICALLY they all yield the same value, but ON A DIGITAL COMPUTER they do not -- roundoff matters, and can easily vary.

That is the situation here -- it seems that different parts of the code use mathematically equal equations to calculate the values being compared, but digitally THEY ARE DIFFERENT. The values agreed to 14 significant digits, but did not compare equal.

You are burying your head in the sand. Do you seriously expect all of your users, and all of your own programmers, to carefully use only a specified equation to compute kinetic energy? Which equation do you require? (I'll bet you don't know.)

Yes, this particular error is in multiple scattering, which is stochastic. The problem is that it does not generate exactly the same result, making regression tests difficult. The reason I found it was a failure of one of my regression tests -- this minuscule difference blossomed into a difference of several millimeters in track positions; my test required positions to be equal within 0.2 mm; in the past they were equal to within a micron.

And THAT illustrates the problem: I'll bet that no changes were made that were thought to change anything, and tests confirmed no change, but yet this did change. Because physicist programmers use mathematics and rarely analyze the roundoff of the equations they use.

Do you really want code in Geant4 that would be marked INCORRECT in CS101?
And which does cause problems for some of your users? (G4beamline represents >400 users.)