Problem 2188 - Events not independent due to magnetic field + looper thresholds kill primary particle in all subsequent events
Summary: Events not independent due to magnetic field + looper thresholds kill primary...
Status: RESOLVED INVALID
Alias: None
Product: Geant4
Classification: Unclassified
Component: geometry/magneticfield (show other problems)
Version: 10.5
Hardware: Apple Mac OS X
: P4 major
Assignee: John Apostolakis
URL:
Depends on:
Blocks:
 
Reported: 2019-08-16 11:45 CEST by Laurie Nevay
Modified: 2021-12-14 11:22 CET (History)
0 users

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this problem.
Description Laurie Nevay 2019-08-16 11:45:32 CEST
Since Geant4.10.5 (including p01) if a particle gets stuck and an event safely aborted, the navigation can remain in a bad 'state' and every subsequent event has severe problems.  This behaviour can include (shockingly) killing the primary particle on the first step.  This is a combination of problems.  None of these were a problem with the same model in any previous version of Geant4.

1) Related to Issue #2144 - looper thresholds triggered, subsequent primaries killed because below threshold which is by default relatively high. Not exactly the same though as I see it for charged particles too and for particles above this threshold.
2) Tracking is not completely independent between events. The bad navigation state remains causing problems in the next event (code below to combat this).
3) The code required to combat this is rather verbose because of the interface and has to be copied from an example, rather than being part of Geant4. It's also highly order dependent making our application difficult to maintain.

Details:
On one event, I see one secondary particle stuck in a rough field (approximate field description).  It's identified as looping and killed - so far, so good (well, "Proposed Step is 1.4228e-05 but Step Taken is 1.4228e-05").  However, the primary (charged) particle (single particle from gun) is killed on the first step for every subsequent event.

The model is of a small 100m beam line with around 30 magnets, each with unique fields in the vacuum portion of the magnet and an approximate field in the yoke. Our application has been used like this for many years without problem.  However, I see this with a few models.

I see this for event ~#65k out of 100k.  The rate is low, but it's a blocker for the whole run. It's a combination of seed and initial distribution that eventually finds a secondary in a certain place. This will eventually happen even if I split up the problem into smaller chunks of fewer events (I've tried on a cluster).

If I use the same set of seeds but start 1 seed after the problem event, things progress ok.  This means the state of navigation is not cleared between events. This should be ok if we never have any navigation problems.  The geometry has no identifiable overlaps and while the field isn't the smoothest, it is also not discontinuous.  A G4ClassicalRK4 is used as the integrator with a step limit of 110% the volume length.

few setup notes
- many local fields on specific volumes
- no global field
- sequential compilation - no MT

I've now had to introduce the following bits of code to combat this.

In EventAction::BeginOfEventAction(G4Event* evt)
```
  // make further attempts to clear Geant4's tracking history between events to make them
  // truly independent.
  G4Navigator* trackingNavigator = G4TransportationManager::GetTransportationManager()->GetNavigatorForTracking();
  trackingNavigator->ResetStackAndState();
  G4TransportationManager* tm = G4TransportationManager::GetTransportationManager();
  int i = 0;
  for (auto it = tm->GetActiveNavigatorsIterator(); i < (int)tm->GetNoActiveNavigators(); it++)
    {(*it)->ResetStackAndState(); i++;}
  tm->GetPropagatorInField()->ClearPropagatorState(); // <- this one really makes a difference

  // Unfortunately the interfaces to G4TransportationManager aren't great which makes this a bit
  // pedantic. Also, the GetNavigator creates an exception if it doesn't find what it's looking
  // for rather than just return a nullptr
  G4bool samplerWorldExists =  false;
  std::vector<G4VPhysicalVolume*>::iterator worldIterator = tm->GetWorldsIterator();
  for (G4int iw = 0; iw < (G4int)tm->GetNoWorlds(); iw++)
    {
      samplerWorldExists = samplerWorldExists || (*worldIterator)->GetName() == "SamplerWorld_main";
      worldIterator++;
    }
  if (samplerWorldExists)
    {
      auto swtracker = tm->GetNavigator("SamplerWorld_main");
      if (swtracker)
        {swtracker->ResetStackAndState();}
    }
  fpEventManager->GetStackManager()->clear();
```

 tm->GetPropagatorInField()->ClearPropagatorState(); is the most effective in this case.

The following code is called from 3 places - physics process construction for the primary beam particle, PrimaryGeneratorAction in the case of an ion beam as this can't be done before PriamryGeneratorAction without a crash, and in BeginOfRunAction for the e+ and e- (irrespective of the beam).

```
void BDS::FixGeant105ThreshholdsForParticle(const G4ParticleDefinition* particleDef)
{
  // in the case of ions the particle definition isn't available early on so protect
  // against this
  if (!particleDef)
    {return;}
  // taken from the Geant4.10.5 field01 example
  // used to compensate for agressive killing in Geant4.10.5
  G4double warningEnergy   =   1.0 * CLHEP::kiloelectronvolt;  // Arbitrary
  G4double importantEnergy =  10.0 * CLHEP::kiloelectronvolt;  // Arbitrary
  G4double numberOfTrials  =  1500;                            // Arbitrary
  auto transportPair    = BDS::FindTransportation(particleDef);
  auto transport        = transportPair.first;
  auto coupledTransport = transportPair.second;

  if (transport)
    {
      // Change the values of the looping particle parameters of Transportation
      transport->SetThresholdWarningEnergy(warningEnergy);
      transport->SetThresholdImportantEnergy(importantEnergy);
      transport->SetThresholdTrials(numberOfTrials);
    }
  else if(coupledTransport)
    {
      // Change the values for Coupled Transport
      coupledTransport->SetThresholdWarningEnergy(warningEnergy);
      coupledTransport->SetThresholdImportantEnergy(importantEnergy);
      coupledTransport->SetThresholdTrials(numberOfTrials);
    }
}

std::pair<G4Transportation*, G4CoupledTransportation*> BDS::FindTransportation(const G4ParticleDefinition* particleDef)
{
  const auto* partPM = particleDef->GetProcessManager();

  G4VProcess* partTransport = partPM->GetProcess("Transportation");
  auto transport = dynamic_cast<G4Transportation*>(partTransport);

  partTransport = partPM->GetProcess("CoupledTransportation");
  auto coupledTransport = dynamic_cast<G4CoupledTransportation*>(partTransport);

  return std::make_pair(transport, coupledTransport);
}
#endif
```

Additionally, I set low looper thresholds in my main before the physics is constructed (order dependent).  

```
G4PhysicsListHelper::GetPhysicsListHelper()->UseLowLooperThresholds();
```

This must be done before any physics is created as the parameters are copied into the transportation physics process for each particle and it's very hard to sift through and fix all of them afterwards.  I do specifically for my beam particle though.

With verbose stepping 2 for the bad event, I see a few secondaries are tracked and one gets stuck, the warning print out shows the next primary being killed for looping with some confusing output.

"number of propagation trials = 0 of max 10)" but "number of calls of Transport/AlongStepDoIt = 35315"
"149.495 MeV in energy" but "current value of warning threshold is 100MeV"

The same model works without problem and gives meaningful results with Geant4.10.4 and lower.  The above code snippets are excluded for less than Geant4.10.5.

Proposals:
- if not computationally expensive, navigator states be reset throughout on the Geant4 to ensure event independence, no matter how badly the user treats / uses them
- some of the transportation process searching to set looper thresholds be moved from the example to geant4 to avoid code duplication
- the thresholds be a combination of a percentage of the primary energy and a minimum physics scale (obviously we can't go arbitrarily low) - a 100MeV beam is not an unreasonable prospect for Geant4 usage, but requires fairly advance settings to get tracking right
- introduce the ability to turn off the killing of looping particles - whilst very useful in some cases, it would be good to have more high level control over using these features

In the mean time, I'm trying to get to a quicker and simpler example to expose this.  Apologies there are no specific code changes suggested here but the snippets hint at where it's coming from.  Certainly, the solution to 2144 will help but not resolve this issue completely.

Thanks in advance,
Laurie
Comment 1 John Apostolakis 2019-08-16 16:48:34 CEST
Dear Laurie,

Thanks for your extensive report.  You cover a lot in it: the current erroneous behaviour which you observe only in 10.5, your current workaround, issues in creating this workaround, ideas for more comprehensively clearing the state of the full set of navigators and the propagation in field before the start of each event and suggestions for improving the usability of the handling of looping particles.

I will touch a selection of these topics: trying to identify the underlying cause of the current problem, trying to tighten the robustness of event independence via resetting the state of navigators and field classes and a few related items.

I agree that the problem you report is serious, and the robustness of the relevant components must be improved to provide multiple levels of insurance to avoid recurrence.

The cause of the runaway stuck navigation state 
-----------------------------------------------

My first impression is the second of the two issues which I patched in my fix for issue 2144 could be responsible for some or most of the problem which you see.  In your code comments it mentions that clearing the state of the G4PropagatorInField helps a lot -- could you clarify whether the situation becomes less frequent or what the remaining effects are ?

Could you please attach the tracking verbose information of at least the last few (1-2 or more) tracks before the problem occurs, and of a few tracks which are misbehaving ?

If you save or print the state of the RNG at the start of each event, are you able to reproduce the problem by restarting at the start an earlier event ?  Hopefully restarting in exactly the event in which the problem occurs it should recur, or else starting a few events earlier (e.g. 2, 5, 10, 20 or 50 events in case that does not work due to an unknown memoization / memory effect ?

Event independence
------------------

Ensuring the full independence of the execution of different events is a very high priority.  We have attempted to ensure it and anything that is stopping this from occurring is either an oversight or a bug.

The G4EventManager class does reset the state of the main (tracking) navigator at the start of each event.  It must also reset the state of remaining navigators in case other geometries are registered.

In addition the G4PropagatorInField should also be reset at the start of each event.  The overhead of doing it there also (in addition to resetting it at the start of each track) should be negligible.   This may or not be responsible for a substantial part of your current problem(s).

Find the right Transportation process(es) - is there one or more ?
------------------------------------------------------------------
I understand that you are trying to ensure that your chosen 'looper' parameters are always set for the Transportation process(es) used by all particle types (preferably) or at least by the most important ones.

Currently in the 'standard use of G4' using the modular physics lists, only a single Transportation (or Coupled Transportation) object is constructed and it is shared by all particle types.  So setting its parameters can occur any time -- only multithreading applications must act carefully to ensure that the parameters are set correctly for other threads.  

Coping with multi-threading is the reason that I used the complex code that searches for the Transportation code of a particle type, and then sets its parameters.  In a single threaded application, though, this should not be necessary, if the Transportation process is created in the 'standard' way.

From your explanation it seems either that complicated code has caused a misunderstanding -- or else maybe indeed the 'standard' recipe is not followed in your physics lists.

So could you clarify whether your framework / application is creating a separate Transportation process for particular types of particles.  Or was there indeed just due to that misunderstanding ?

I will tackle additional topics in a subsequent comment.

Best regards,
John
Comment 2 John Apostolakis 2019-08-20 18:12:06 CEST
Could you please clarify whether your setup includes multiple geometries ?
If you can explain briefly their utility, it may inform our understanding of
the use case.  From a look at the BDSIM source code it would appear that
likely there are multiple geometries and a navigator related to each one.
Are all navigators and geometries registered with Geant4 ?

In my inspection to date, I have found important state, only in the Navigator
objects for parallel geometries (missed, as mentioned earlier), and
potentially the G4PropagatorInField object (which should now be reset at the
start of each track correctly after the previous patch).

Unless I have missed something, only these state that might affect the
simulation of subsequent events.  I am preparing a revision that would include
such changes in the startup of an event, to better ensure event independence.

By the way, it seems to me that the following code snippet could replace 
the code which you use to find the navigator for a parallel world
( also avoiding an exception in case of a bad name. )

  G4VPhysicalVolume* worldPhys= IsWorldExisting(worldName);
  if( worldPhys ) nav = GetNavigator( worldPhys );

I agree that the code for modifying the parameters of the Transportation
process(es) should be included in the Geant4 source code.  My aim is also 
to simplify it.

Could you clarify, though, your remark that the result is
"highly order dependent" ?  Once the transportation process is constructed,
the same object is assigned to every particle.  So, as long as the
transportation object has been constructed by G4PhysicsListHelper, when its
parameters are changed should not matter - the results should be stable so
long as this is done before the simulation of the first track of the first
event.
Comment 3 John Apostolakis 2020-02-06 13:28:18 CET
Response from Laurie Nevay, 28 August 2019

"Thank you for your prompt and thorough replies.  Apologies for my delayed reply - it was a busy week.

The problem is most reproducible with a user's model - it was reported in our public issue tracker here: 
https://bitbucket.org/jairhul/bdsim/issues/268/primaries-unphysically-lost-in-first
Although this issue is closed from a BDSIM point of view, the problem persists.  I've spent a few days trying to make a simpler way to recreate it without much success so I'll stick to the one I know shows the problem.

In the recent tagged versions of BDSIM, we have several fixes to help with these problems (as described in 2188), so I've made a branch for you that temporarily removes these issues.  You can get BDSIM as follows:

git clone https://bitbucket.org/jairhul/bdsim
git checkout geant4-issue-2188

Apart from Geant4 (I presume you'll use your own build), BDSIM requires:
- compiler with full C++11 support (GCC 4.9 upwards for example)
- ROOT 6 series
- flex
- bison
- Geant4 be compiled against an external CLHEP, not the included one with G4 (we need some extra classes and we need to ensure only one random number generator for strong reproducibility).
It follows the typical cmake setup.  See here for details and just drop me an email if there's any problems. 
http://www.pp.rhul.ac.uk/bdsim/manual/installation.html

I've put the model in a folder on dropbox so you can download it all.   It happens at event 112094.  I spent some time trying to find an easier way to recreate this but haven't found one yet. On the plus side, you can recreate from an event beforehand based on my output file (sadly 500mb root file - sorry about this).

If I start on or before the problem event, the primary is killed in every subsequent event.  If I start after that event, then they proceed normally.  

Our recreation mode loads CLHEP seed states from the output file and we should be strongly reproducible.

In the dropbox folder (link at bottom), please look at "notes.txt" that has the exact commands to run bdsim for the original file (unnecessary for you) and to recreate (for you).  The recreation is therefore very quick as most events are under a second.  There are snippets of relevant terminal output in various text files starting with "terminal-output..." The command you want to show the problem is:

bdsim --file=lattice.gmad --outfile=tu1-r3 --batch --ngenerate=10 --recreate=tu1.root --startFromEvent=112096

BDSIM does not (yet) permit recreating a recreation, so it can only be done w.r.t. the original big output file "tu1.root".

If you want to look at any data in any way, there's an example in "lxplus-analysis-example.txt" including a path to a script on lxplus to get a complete install (v1.3.3 with fixes included) - so only really good for the libraries to read the data, plus our python / root setup.

The differences for the branch genat4-issue-2188 are detailed in "geant4-issue-2188-branch-patch.txt".

Although, the model was posted publicly on our issue tracker, let me just check with the user whether there's any objection to the model being on the Geant4 bug tracker (best to be safe).

Apologies, it's not simpler to reproduce... but it is reproducible with 10.5.p01.  It kills primary e- particles at 105MeV vs apparent warning limit of 100MeV.

I'm also around at CERN if you'd find it useful for me to show you anything in person or discuss.
Dropbox link: [not copied]

Once again, thank you for your help.
Best,
Laurie"
Comment 4 John Apostolakis 2021-09-20 17:54:09 CEST
It is not clear whether this problem persists with more recent versions of Geant4.  

For one the default methods for integration for charged particles in magnetic field have changed in Geant4 10.6 and 10.7.

Information about this could help identify whether this is a problem in Geant4, or the application.
Comment 5 John Apostolakis 2021-12-14 11:22:04 CET
Since there has been no feedback we will close the issue.

If you have updates to provide, please contact us to reopen it.

Best.