Problem 2212

Summary: CPU occupancy strangely decreases over time for long MT run (and memory leaks)
Product: Geant4 Reporter: perl
Component: processes/hadronic/models/de_excitationAssignee: Vladimir.Ivantchenko
Status: RESOLVED FIXED    
Severity: minor CC: asai, Gunter.Folger, John.Allison, Vladimir.Ivantchenko
Priority: P4    
Version: 10.5   
Hardware: All   
OS: All   
Attachments: OSX Activity Monitor Sampling Call Tree
TOPAS Parameter Control File demonstrating the issue

Description perl 2020-01-15 19:39:34 CET
A TOPAS user brought me a case where having more than 1 worker thread results in gradual but eventually severe slowdown of processing, that is, over time he sees the CPU usage going from 100% to 50% or less.
I have been able to consistently reproduce this.
The workers eventually spend almost all their time waiting on some sort of lock.
Here are the details:

Latest TOPAS, so based on Geant4.10.5.p1.
User reports issue was is not present in the TOPAS that was based on Geant4.10.4.
It is something new since Geant4.10.5.

I am unable to run any quick tests in Geant4.10.6 as it will require other work to bring TOPAS up to 10.6,
however I do not see anything in the 10.6 release notes that would seem to involve this issue, so suspect the issue is still present.

The setup is very simple.
A proton beam in vacuum hitting a small (cm scale) cylinder of water.
No scoring, no output at all except for reporting event number to console every 10K events.
User initially saw this in clinical protons, 100MeV range, then noted that he could make the problem occur in fewer events if he increased the energy to 1GeV. So for my tests I've been using 1GeV.

Physics list is the default modular physics list for TOPAS, which contains modules:
G4EMStandardPhysics_option4
G4HadronPhysicsQGSP_BIC_HP
G4DecayPhysics
G4IonBinaryCascadePhysics
G4HadronElasticPhysicsHP
G4StoppingPhysics

If I run a job with 7 worker threads on my Mac, then at the start, as expected, activity monitor shows the job uses 700%.
Then, over time, the CPUs start spending more and more of their time idle.
At 11 hours total CPU time, I'm seeing only 200% total CPU being used rather than the expected 700% even though there are tons more events still to be processed.

At first I thought maybe some workers were stuck, but from my console logs that report the event number every 10K events, I can see that all workers are still working at equal rates to each other. It just seems like the workers are spending most of their time waiting rather then using CPU.

I did some tests and saw that there is no issue if I only run one worker thread
(note that all of my TOPAS work is using Geant4 in MT mode, even if I only have one worker thread).
But the slowdown gets worse the more events are run (a very linear degradation),
and the slowdown gets worse the more worker threads are used.

Testing with 1, 2 and 4 workers, what I see at around 125M events is:
Job with 1 worker: still running 100%
Job with 2 workers: only getting 185%. So lost 15%/2threads = 7.5% per thread
Job with 4 workers: only getting 290%. So lost 110%/4threads = 27% per thread

Also of note: while one would expect the total CPU time (total over all threads) to be roughly the same for a 2 worker job or a 4 worker job, the 4 worker job actually took significantly more total CPU time to do the same total number of events.

The issue is resolved if the physics modules are reduced to only G4EMStandardPhysics_option0.
All workers then always stay at 100% CPU.

Similarly, the issue is resolved if the physics list is switched to FTFP_BERT
(this was an additional test requested of me by Makoto).

Issue is reproducible on both Linux and OSX.

I left one of the 4 worker tests running a long time until it got very badly affected, then did some sampling from Activity Monitor to see what the workers were doing.
Call Tree is attached.
If I'm reading this correctly, the majority of time is being spent in a mutex wait in the call chain:
G4BinaryCascade... G4PreCompoundModel... G4ExcitationHandler... G4IonTable

I don't see anything in the Geant4.10.6 release notes to indicate anything like this has already been seen and fixed.

I don't see any bug report that seems related.
Is this something you have seen?
I wonder if it is not noticed by others as this case is running a much higher number of events than one usually does.


On Makoto's request, I ran some memory leakage tests:
Again running in 7 worker threads.
Looked at memory as shown in OSX Activity Monitor, so this is the total including all worker threads and master.
Checked at 10M events, 20M events, 30M, 40M and 50M.
Ignoring the first 10M events, the job that had the CPU slowdown issue leaks 101M for each 10M events.
The leakage rate is very steady for each additional 10M events. That is, it is linear.

EM only test and FTFP_BERT tests leak much smaller amounts, maybe 100K for each 10M events (and it is possible that this
leak is in the TOPAS layer, not Geant4).

I apologize for not yet having a simple Geant4 setup for this. I instead attach the TOPAS parameter file.

I also attach the Activity Monitor call graph taking when the job has slowed to a crawl.
Comment 1 perl 2020-01-15 19:41:39 CET
Created attachment 599 [details]
OSX Activity Monitor Sampling Call Tree

This sample was taking after the job had run for a long time and the CPU slowdown had become very severe. It appears to me to show the workers spending most of their time in a mutex lock in G4IonTable, however I do not claim to be an expert at interpreting these call trees.
Comment 2 perl 2020-01-15 19:43:58 CET
Created attachment 600 [details]
TOPAS Parameter Control File demonstrating the issue
Comment 3 Gunter.Folger 2020-01-17 10:19:23 CET
From the call chain I read that the problem is eitehr in de-excitation or in the IonTable, so I update the component to de-excitation.
Comment 4 perl 2020-01-23 00:29:52 CET
Good news. The problem seems to be fixed in Geant4.10.6.

Comparing Geant4.10.5.p1 versus 10.6,
a job with 7 worker threads, after running 100M events:
Geant4.10.5.p1, CPU use is down to 300% and memory use is 1.10GB.
Geant4.10.6, CPU use remains solid at 697% and memory use is 145.MB.

So probably no need to address this issue unless there is a future need to create
a further Geant4.10.5 patch.
Comment 5 Gabriele Cosmo 2020-01-23 09:13:18 CET
Joseph, in releasing 10.6 we introduced several fixes for memory leaks here and there (more or less serious) which were either originated from new development or present also in previous releases, particularly in the physics. Most of these fixes are not straightforward to back port to older releases, as rather entangled with the new developments.
I suggest you try to verify with your tests if the situation improves in 10.6, let us know and eventually switch TOPAS to the new release!
Comment 6 perl 2020-01-23 23:12:24 CET
(In reply to Gabriele Cosmo from comment #5)
> Joseph, in releasing 10.6 we introduced several fixes for memory leaks here
> and there (more or less serious) which were either originated from new
> development or present also in previous releases, particularly in the
> physics. Most of these fixes are not straightforward to back port to older
> releases, as rather entangled with the new developments.
> I suggest you try to verify with your tests if the situation improves in
> 10.6, let us know and eventually switch TOPAS to the new release!

Hi Gabriele,

I think you missed my comment before yours.
The problem is indeed gone in 10.6.
And yes, as usual, we will be moving to the latest release.
The port has gone easily and we are just now starting our detailed tests.
Comment 7 Vladimir.Ivantchenko 2020-02-06 19:11:50 CET
Hi all,

I close this bug report. The difficulty for 10.5 patch is that we do not know what exact fix resolve this problem. Probably, the best recommendation for users who run high statistics is to migrade to 10.6. By the way, physics of 10.5 and 10.6 are compatible.

Vladimir