Problem 2368 - Issue with Reproducibility
Summary: Issue with Reproducibility
Status: ASSIGNED
Alias: None
Product: Geant4
Classification: Unclassified
Component: processes/electromagnetic (show other problems)
Version: 10.7
Hardware: All All
: P4 normal
Assignee: Vladimir.Ivantchenko
URL:
Depends on:
Blocks:
 
Reported: 2021-05-14 14:53 CEST by Simon Spannagel
Modified: 2022-02-01 02:32 CET (History)
4 users (show)

See Also:


Attachments
G4EmParameters::Instance()->Dump() (4.55 KB, text/plain)
2021-05-17 20:18 CEST, Simon Spannagel
Details

Note You need to log in before you can comment on or make changes to this problem.
Description Simon Spannagel 2021-05-14 14:53:52 CEST
Dear Geant4 developers,

I have been asked to report this issue here from a e-mail conversation with John Apostolakis and Gabriele Cosmo.

A preface:
I am the main developer of a detector simulation framework called Allpix Squared that uses Geant4 as the very first step in a detailed simulation of the response of silicon
detectors to incident radiation [1]. Geant4 accounts for < 10% of the computing time spent for simulating a single event. We are currently about to roll out a multi-threaded version of our framework, and since Geant4 is only one option of several to simulate the deposition of energy in the sensor, we cannot rely on the Geant4-internal multi-threading. Instead, we have derived our own MTRunManager and WorkerRunManager classes [2] from the G4MTRunManager and G4WorkerRunmanager to use the threads already spawned by our own framework for processing events. What we call an "event" corresponds to one Geant4 call to BeamOn.

In order to provide strong reproducibility, we have event-based seeding with every event receiving a sequential seed from a central PRNG. This event seed is in turn used to initialize a thread-local PRNG which serves for processing this particular event on a given thread.

For Geant4, our MTRunManager sets two seeds provided by the event's PRNG to the seedQueue of the WorkerRunManager and start the run:

void MTRunManager::Run(G4int n_event, uint64_t seed1, uint64_t seed2) {

    // Seed the worker run manager for this event:
    worker_run_manager_->seedsQueue.push(static_cast<long>(seed1 % LONG_MAX));
    worker_run_manager_->seedsQueue.push(static_cast<long>(seed2 % LONG_MAX));

    // redirect the call to the correct manager responsible for this thread
    worker_run_manager_->BeamOn(n_event);
}

The worker run manager is a simplified version of Geant4's version, with our own method GenerateEvent() which you can find at [3].

This seems to be working just fine and we get reproducible outcomes of the event - in most cases. During extensive testing of the reproducibility of our framework we realized that when we however re-run the same simulation (same starting seed and configuration) with several hundred thousand events, there are always a few events which differ in a subtle way despite having received the same seed as their counterpart in the initial simulation. The difference seems to be mostly in low-energetic delta rays produced by Geant4, which we obtain in our ProcessHits action from the individual steps [4] and which in turn of course completely change the result and chain or random numbers used in this event throughout the framework. The events are isolated via the seeding, so the other events don't seem to be affected, but we are puzzled where this could come from.

It is interesting to note that this does not seem to happen in sequential (single-threaded) mode as far as we can tell. Also here, we use our own RunManager derived from G4RunManager and set the event seeds via

void RunManager::Run(G4int n_event, uint64_t seed1, uint64_t seed2) {

    // Set the event seeds - with a zero-terminated list:
    std::array<long, 3> seeds{static_cast<long>(seed1 % LONG_MAX), static_cast<long>(seed2 % LONG_MAX), 0};
    G4Random::setTheSeeds(&seeds[0], -1);

    // Call the RunManager's BeamOn
    G4RunManager::BeamOn(n_event);
}

in order to have the same interface. Events produced with the single-threaded and the multi-threaded version match - apart from the occasional difference in the MT version.

Some additional information:

* We are using the latest, full patched version Geant4 10.7.p01 but this also occurs with earlier versions.
* This occurs with any of the standard physics lists we tested, e.g. FTFP_BERT_EMZ
* We have reproduced this on multiple machines with different processors (e.g. Intel Xeon E5-2640 and AMD EPYC 7402)

In order to obtain a trace of these diverging events, we have followed the suggestion provided by John and repeated the experiment many times, each time starting two simulations with the same seed (but different seeds per iteration), simulating 2000 events and logging the full "verbosity 6" level.

After doing this a few times and comparing the result for each event I was able to single out one individual event where things break down. Under the link below [5] you can find two diverging traces for this event (#1767) for the two simulation runs. The first line is our input seed for the event (unfortunately not the event seeds for Geant4 but I can probably reproduce them if necessary), the rest is the logging from Geant4.

run_0_evt1767.txt
run_1_evt1767.txt

To cross-check, I also looked at the event right before the problematic one (#1766), from the same runs, and diff'ing their two files prints the different G4Track pointer addresses as only difference. I also placed these traces under the link below.

run_0_evt1766.txt
run_1_evt1766.txt

Do you have any idea where this problem in reproducibility could come from?

All the best and thank you very much,
Simon


[1] This is the DepositionGeant4 module:
https://gitlab.cern.ch/allpix-squared/allpix-squared/-/tree/mt_g4_seeding/src/modules/DepositionGeant4
[2] Our manager classes:
https://gitlab.cern.ch/allpix-squared/allpix-squared/-/tree/mt_g4_seeding/src/tools/geant4
[3] WorkerRunManager:
https://gitlab.cern.ch/allpix-squared/allpix-squared/-/blob/mt_g4_seeding/src/tools/geant4/WorkerRunManager.cpp#L112
[4] ProcessHits:
https://gitlab.cern.ch/allpix-squared/allpix-squared/-/blob/mt_g4_seeding/src/modules/DepositionGeant4/SensitiveDetectorActionG4.cpp#L50
[5] Event traces:
https://sas.desy.de/index.php/s/ni9ogS9e6FmC3KY
Comment 1 John Apostolakis 2021-05-17 16:54:00 CEST
There seems to be one 'trigering' difference in event 1767 as seen in the fully verbose (/tracking/verbose 6).

Recreating the 'verbose 1' output from the relevant track, you can see that in run 0 the process does not generate an energy deposit (but creates an extra secondary): 

* G4Track Information:   Particle = gamma,   Track ID = 12,   Parent ID = 11
*****************************************************************************
Step#  X(mm)  Y(mm)  Z(mm) KinE(MeV)  dE(MeV) StepLeng TrackLeng  NextVolume ProcName
 0   0.0399   0.0556   302   0.00118        0        0         0 support_telescope5_phys_0 initStep
 1   0.0483   0.0587   302         0        0   0.0131    0.0131 support_telescope5_phys_0 phot
 
whereas in run 1 no energy deposit is generated but 3 secondaries are generated.

 * G4Track Information:   Particle = gamma,   Track ID = 12,   Parent ID = 11
 *****************************************************************************
 Step#    X(mm)    Y(mm)    Z(mm) KinE(MeV)  dE(MeV) StepLeng TrackLeng  NextVolume ProcName
     0   0.0399   0.0556      302   0.00118        0        0         0 support_telescope5_phys_0 initStep
     1   0.0483   0.0587      302         0 4.34e-05   0.0131    0.0131 support_telescope5_phys_0 phot

In run0 the fully verbose output (level 6) output reports:

 >>PostStepDoIt (process by process):    Process Name = phot
 
     ++G4Step Information 
       Step Length (mm)      : 0.01312547223377038

    ++G4ParticleChange Information 
       -----------------------------------------------
         G4ParticleChange Information  
       -----------------------------------------------
         # of 2ndaries       :                    3
       -----------------------------------------------
         (Showed only 1st one)
       -----------------------------------------------
         Energy Deposit (MeV):                    0


whereas in run1 the same process gives 2 differences mentioned above:
       -----------------------------------------------
         # of 2ndaries       :                    2
       -----------------------------------------------
         (Showed only 1st one)
       -----------------------------------------------
         Energy Deposit (MeV):             4.34e-05

It is likely that this is due to some memory in the relevant process (or maybe a flag/value which is not correctly updated on some threads.)
Comment 2 Vladimir.Ivantchenko 2021-05-17 17:51:16 CEST
Hello Simon,

likely you touch a real problem of Geant4 10.7p01, however, analysis of it and the fix seems to be not easy. Can I ask simple questions: 

- do you see the problem with the same frequency using FTFP_BERT and FTFP_BERT_EMZ?

- Are there in your application some extra physics/cuts/parameters on top of these reference Physics Lists?

- what is the material, in which this happens in this published event?

- what is the production threshold (cut in range and corresponding threshold) for gamma and electron in this material?

VI
Comment 3 Simon Spannagel 2021-05-17 18:14:39 CEST
Hi Vladimir,


- do you see the problem with the same frequency using FTFP_BERT and FTFP_BERT_EMZ?

Because of its low rate that's really hard to say. I will run some simulations but cannot promise there will be anything useful coming out...



- Are there in your application some extra physics/cuts/parameters on top of these reference Physics Lists?

There are a few additional settings we have, which are:
- A range cut via "/run/setCut 0.011"
- G4UserLimits for our active material with (0.01, DBL_MAX, 2.21e11)
- G4UserLimits for the world volume with (DBL_MAX, DBL_MAX, 2.21e11) 



- what is the material, in which this happens in this published event?

The material in which this happened in the event posted is G10 (PCB), defined as:

auto* Epoxy = new G4Material("Epoxy", 1.3 * CLHEP::g / CLHEP::cm3, 3);
Epoxy->AddElement(H, 44);
Epoxy->AddElement(C, 15);
Epoxy->AddElement(O, 7);

auto* GTen = new G4Material("G10", 1.7 * CLHEP::g / CLHEP::cm3, 3);
GTen->AddMaterial(nistman->FindOrBuildMaterial("G4_SILICON_DIOXIDE"), 0.773);
GTen->AddMaterial(Epoxy, 0.147);
GTen->AddElement(Cl, 0.08);


- what is the production threshold (cut in range and corresponding threshold) for gamma and electron in this material?

That is a good question - we don't configure anything in particular. The above range cut is only applied to our (silicon sensor) active bits of the setup. How do I find out the values set by Geant4?

Best regards,
Simon
Comment 4 Vladimir.Ivantchenko 2021-05-17 19:27:21 CEST
Hello Simon,

may be enough to say if you see the problem in FTFP_BERT too, not only in FTFP_BERT_EMZ?

Concerning cuts: in the default run manager we enable dump of production thresholds by the UI command "/run/verbose 1" but in your case of custom RunManager this should be done differently. Does not matter now, because you already describe the material. It is enough to know range cut 0.011 mm.

Can you recheck in after initialisation but before the event loop, that EM parameters are the same in each thread? For that you may call:

G4EmParameters:Instance()->Dump()

or

G4EmParameters:Instance()->StreamInfo(std::ostream& os)
if you are using custom output streams. 

It is important to see if the dump is the same in all threads and parameter values are as desired. The dump is short, so it may be posted here. 

Vladimir
Comment 5 Simon Spannagel 2021-05-17 20:18:14 CEST
Created attachment 694 [details]
G4EmParameters::Instance()->Dump()
Comment 6 Simon Spannagel 2021-05-17 20:18:28 CEST
Hi Vladimir,

I added the parameter dump after all threads and worker run managers have been initialized, and I see the exact same configuration for all of them. I have attached the dump for reference.

Best regards,
Simon
Comment 7 Simon Spannagel 2021-06-08 09:58:18 CEST
Hi Vladimir,

a brief heads-up - I ran with FTFP_BERT over the past weeks, totaling to sth like 30x2 simulations with 2k events each (as described above) and have not found any difference between events. I'm not sure we can call this "conclusive", I keep running for some more to see if anything pops up.

Is there anything else I can check for you?

/Simon
Comment 8 Vladimir.Ivantchenko 2022-01-05 17:01:29 CET
Hello, Simon,

sorry to return to the issue too late. Few comments:

1) the fact that FTFP_BERT does not show the problem in contrary with FTFP_BERT_EMZ means that the problem is likely in Geant4 and not in your code. There is also a possibility that you face a secondary effect of the real problem.

2) In the event where non-reproducibility happens the process is the photoelectric effect. In one case it is generating final e- + two extra particles likely due to atomic de-excitation, in another - no secondaries which is wrong, because no secondaries may be only in the case if "ApplyCuts" option is enabled, which is not the case according to your dump of EM parameters. The same model of photoeffect is used in both Physics Lists, which makes the explanation even more difficult.

3) Today we have 2 extra patches to 10.7, in these patches nothing changed for photoeffect and for EM physics in general.

4) A problem may be if you construct FTFP_BERT_EMZ yourself and not via G4PhysListFactory or if you are using G4EmProcessOptions class. I hope it is not the case.

So, a practical solution may be to use FTFP_BERT and if needed apply some extra options on top using G4EmParameters interface and/or corresponding UI commands.

Vladimir
Comment 9 Vladimir.Ivantchenko 2022-01-05 17:04:07 CET
Hello, Simon,

sorry to return to the issue too late. Few comments:

1) the fact that FTFP_BERT does not show the problem in contrary with FTFP_BERT_EMZ means that the problem is likely in Geant4 and not in your code. There is also a possibility that you face a secondary effect of the real problem.

2) In the event where non-reproducibility happens the process is the photoelectric effect. In one case it is generating final e- + two extra particles likely due to atomic de-excitation, in another - no secondaries which is wrong, because no secondaries may be only in the case if "ApplyCuts" option is enabled, which is not the case according to your dump of EM parameters. The same model of photoeffect is used in both Physics Lists, which makes the explanation even more difficult.

3) Today we have 2 extra patches to 10.7, in these patches nothing changed for photoeffect and for EM physics in general.

4) A problem may be if you construct FTFP_BERT_EMZ yourself and not via G4PhysListFactory or if you are using G4EmProcessOptions class. I hope it is not the case.

So, a practical solution may be to use FTFP_BERT and if needed apply some extra options on top using G4EmParameters interface and/or corresponding UI commands.

Vladimir
Comment 10 Simon Spannagel 2022-01-10 10:07:30 CET
Dear Vladimir,

thank you for your comments!
We indeed use the physics list factory and are not building the lists ourselves, this is the relevant code (abbreviated for some debug messages):

    G4PhysListFactory physListFactory;
    G4VModularPhysicsList* physicsList = physListFactory.GetReferencePhysList(physics_list);
    if(physicsList == nullptr) {
        throw InvalidValueError();
    } else {
        LOG(INFO) << "Using G4 physics list \"" << physics_list << "\"";
    }
    // Register a step limiter (uses the user limits defined earlier)
    physicsList->RegisterPhysics(new G4StepLimiterPhysics());

    // Register radioactive decay physics lists
    physicsList->RegisterPhysics(new G4RadioactiveDecayPhysics());

    // ...

    run_manager_g4_->SetUserInitialization(physicsList);
    run_manager_g4_->InitializePhysics();


All the best,
Simon