Problem 1337 - composite_calorimeter (unmodified, fresh install) crashes due to a segmentation fault, when running 300 GeV Pi- particles. Active process Decay, SubType 201, originating from track creator process LambdaInelastic, SubType 121.
Summary: composite_calorimeter (unmodified, fresh install) crashes due to a segmentati...
Status: RESOLVED FIXED
Alias: None
Product: Examples/Advanced
Classification: Unclassified
Component: composite_calorimeter (show other problems)
Version: 9.5
Hardware: Apple Mac OS X
: P5 normal
Assignee: Alberto.Ribon
URL:
Depends on:
Blocks:
 
Reported: 2012-07-16 19:46 CEST by Dylan
Modified: 2015-09-08 09:08 CEST (History)
3 users (show)

See Also:


Attachments
composite_calorimeter output for segmentation fault, geant4.9.5.p01, macosx10.6.8 (35.22 KB, text/rtf)
2012-07-19 00:33 CEST, Dylan
Details

Note You need to log in before you can comment on or make changes to this problem.
Description Dylan 2012-07-16 19:46:42 CEST
NB: I have tested this bug on five different system setups, including the latest Geant 4.9.5.p01 release. The setups are as follows:

1) Mac OSX 10.6.8 - Geant 4.9.3 - Full installation with ./configure
2) Mac OSX 10.6.8 - Geant 4.9.3.p02 - Full installation with ./configure
3) Mac OSX 10.6.8 - Geant 4.9.3.p02 - Pre-compiled binaries
4) Fermi Linux lts30 - Geant 4.9.4.p02 - Full installation with ./configure
5) Mac OSX 10.6.8 - Geant 4.9.5.p01 - Full installation with cmake system.


Dear Sirs/Madams,

I am working with the example code composite_calorimeter and am currently experiencing a segmentation fault during run time. The fault occurs with completely unmodified example code, as packaged on install, as well as my modified code, and occurs fastest when running 300 GeV Pions (pi-) into the calorimeter.
I am running the example with the visualization off (unset G4VIS_USE_OPENGLX), and using the following UI commands at run-time:

/gun/particle pi-
/gun/energy 300 GeV
/run/beamOn 10000

Depending on the install and system, the number of particles before the crash varies. It's best to use several thousand.
The fault tends to occur quickest with 300 GeV pions (pi-) and denser absorber materials, such as Uranium or Tungsten. It does still occur with the default copper however, it just takes longer (the absorber material can be changed towards the bottom of the file "tbhcal96hcal.geom" in the directory "$G4WORKDIR/composite_calorimeter/geom").

I have tested this with three different versions of Geant 4 on two separate operating systems, with completely unmodified composite_calorimeter code, as packaged with each release. The details are as follows:

I have tested using Geant 4.9.3, Geant 4.9.3.p02 and 4.9.5.p01 on Mac OSX 10.6.8. The CLHEP installed is the recommended version 2.0.4.5 and I have Open Scientist Batch version 16.11 installed with the AIDA interface included. The environment is set up as follows, in a bash shell:


export DYLD_LIBRARY_PATH=/Applications/CLHEP/lib
. /Applications/geant4.9.3.p02/env.sh
export ROOTSYS=/Applications/root
export DYLD_LIBRARY_PATH=$ROOTSYS/lib:$DYLD_LIBRARY_PATH
. /Applications/osc_batch/16.11/aida-setup.sh
. /Applications/root/bin/thisroot.sh

. envExample.sh
make clean
make                ## To compile the example ##


I have run several of my own debugs since discovering the fault and the crash always occurs when the active process from PostStepPoint is a Decay, SubType 201, and the creator process for the current track is LambdaInelastic, SubType 121. The following is my own debug output from the CCalSteppingAction class at the point of the crash:

"
Process Name from PostStepPoint is Decay
Process Type from PostStepPoint is 6
Process Type Name from PostStepPoint is Decay
Process Subtype from PostStepPoint is 201


Track creator process from 'aStep' is: LambdaInelastic
Process Type from Track creator process is 4
Process Type Name from Track creator process is Hadronic
Process Subtype from Track creator process is 121
"

These conditions occur a handful of times during the run, before the crash, so there must be another factor at play, causing the bug to occur.
The fault is caused by a problem involving the TSliceID variable in CCalSteppingAction::UserSteppingAction (located in CCalSteppingAction.cc).

A global time value is returned via PostStepPoint->GetGlobalTime() and this value rounded to an integer which is then used to reference the 200 member timeDeposit[] array and increment the appropriate energy to the corresponding member of the array.

However, I have found that a segmentation fault is occurring and causing the program to crash, due to this global time value returned spontaneously taking an extremely large value (after a random number of steps, usually several thousand, which varies with initial conditions). For example 2.0177806e+12 ns, or 2.525698e+11 ns, when it should be taking a value between 0 and 200 nanoseconds.

This somehow causes the rounding operation which occurs in the following line to assign a large negative integer to TSliceID. This value is always the same, regardless of the global time returned from PostStepPoint, setting TSliceID = -2147483648:

TSliceID = static_cast<int>( (PostStepPoint->GetGlobalTime() ) / nanosecond);

The segmentation fault occurs when CCalSteppingAction::UserSteppingAction tries to write to the -2147483648th member of the 200 member timeDeposit[] array, here:

timeDeposit[TSliceID] += aStep->GetTotalEnergyDeposit() / GeV;

I am posting here to see if anyone knows why this extremely large value is being returned by PostStepPoint->GetGlobalTime (rather than a value in the range 0 to 200 nanoseconds), or if anyone can suggest a possible cause or solution to this crash.

To confirm this problem was present with a fresh install, I completely removed Geant 4 from OSX and reinstalled it, as per the instructions in the installation guide, to version 4.9.3.p02. The bug occurred again, unchanged, with completely unmodified example code. I then completely removed 4.9.3.p02 and installed 4.9.5.p01 from scratch with the new cmake system.

The problem is also occurring independently on a Linux system (Fermi Linux lts30 INSTALL for FermiGenericDesktopOffsite), with an install of Geant version 4.9.4.p02. The bug and debug data occur exactly the same with this install. The CLHEP version installed here is 2.1.0.1, and version 16.11.5 of OpenScientist batch is installed.

NOTES ON CRASH WITH 4.9.5.p01 INSTALL:

When I run the example on 4.9.5.p01, I am getting a curious message output by the built-in G4 debug statements. Around 40 or 50 lines per particle, as follows. They were not there when I ran the program on 4.9.3 and 4.9.4, so I'm guessing they're due to a new feature, or new debug statements in the 4.9.5 release:

"
G4QCaptureAtRest: wrong product12 imax= 13 Pa226[0.0] 4-mom= (-34.4136,-110.082,31.561;210498)
G4QCaptureAtRest: wrong product16 imax= 17 Pa222[0.0] 4-mom= (-82.737,-166.913,72.433;206768)
G4QCaptureAtRest: wrong product17 imax= 18 Pa221[0.0] 4-mom= (113.399,-185.094,150.325;205835)
G4QCaptureAtRest: wrong product7 imax= 8 Th231[0.0] 4-mom= (-72.03,-103.689,-86.2623;215164)
... (And so on)
"

I don't know if this output is significant or not. But thought it prudent to include it here.

If this crash is resulting from a problem with my install, environment, supporting software, or a build error, I would ask if anyone has any suggestions for a possible solution.

Or indeed, if it turns out to be a bug in the example code, any help would be appreciated.

Best regards,

Dylan
Comment 1 Andrea Dotti 2012-07-17 13:40:18 CEST
Dear Dylan,
 to reproduce the error I need some additional information.
Can you please add here the output of the application. In particular we need the banner at the very beginning of the application.
In particular we need to know which physics  list (if you changed the default) has been used.
What is the content of the string starting with: 
<<< Geant4 Physics List simulation ....

Thank you,
Andrea

(In reply to comment #0)
> NB: I have tested this bug on five different system setups, including the
> latest Geant 4.9.5.p01 release. The setups are as follows:
> 
> 1) Mac OSX 10.6.8 - Geant 4.9.3 - Full installation with ./configure
> 2) Mac OSX 10.6.8 - Geant 4.9.3.p02 - Full installation with ./configure
> 3) Mac OSX 10.6.8 - Geant 4.9.3.p02 - Pre-compiled binaries
> 4) Fermi Linux lts30 - Geant 4.9.4.p02 - Full installation with ./configure
> 5) Mac OSX 10.6.8 - Geant 4.9.5.p01 - Full installation with cmake system.
> 
> 
> Dear Sirs/Madams,
> 
> I am working with the example code composite_calorimeter and am currently
> experiencing a segmentation fault during run time. The fault occurs with
> completely unmodified example code, as packaged on install, as well as my
> modified code, and occurs fastest when running 300 GeV Pions (pi-) into the
> calorimeter.
> I am running the example with the visualization off (unset G4VIS_USE_OPENGLX),
> and using the following UI commands at run-time:
> 
> /gun/particle pi-
> /gun/energy 300 GeV
> /run/beamOn 10000
> 
> Depending on the install and system, the number of particles before the crash
> varies. It's best to use several thousand.
> The fault tends to occur quickest with 300 GeV pions (pi-) and denser absorber
> materials, such as Uranium or Tungsten. It does still occur with the default
> copper however, it just takes longer (the absorber material can be changed
> towards the bottom of the file "tbhcal96hcal.geom" in the directory
> "$G4WORKDIR/composite_calorimeter/geom").
> 
> I have tested this with three different versions of Geant 4 on two separate
> operating systems, with completely unmodified composite_calorimeter code, as
> packaged with each release. The details are as follows:
> 
> I have tested using Geant 4.9.3, Geant 4.9.3.p02 and 4.9.5.p01 on Mac OSX
> 10.6.8. The CLHEP installed is the recommended version 2.0.4.5 and I have Open
> Scientist Batch version 16.11 installed with the AIDA interface included. The
> environment is set up as follows, in a bash shell:
> 
> 
> export DYLD_LIBRARY_PATH=/Applications/CLHEP/lib
> . /Applications/geant4.9.3.p02/env.sh
> export ROOTSYS=/Applications/root
> export DYLD_LIBRARY_PATH=$ROOTSYS/lib:$DYLD_LIBRARY_PATH
> . /Applications/osc_batch/16.11/aida-setup.sh
> . /Applications/root/bin/thisroot.sh
> 
> . envExample.sh
> make clean
> make                ## To compile the example ##
> 
> 
> I have run several of my own debugs since discovering the fault and the crash
> always occurs when the active process from PostStepPoint is a Decay, SubType
> 201, and the creator process for the current track is LambdaInelastic, SubType
> 121. The following is my own debug output from the CCalSteppingAction class at
> the point of the crash:
> 
> "
> Process Name from PostStepPoint is Decay
> Process Type from PostStepPoint is 6
> Process Type Name from PostStepPoint is Decay
> Process Subtype from PostStepPoint is 201
> 
> 
> Track creator process from 'aStep' is: LambdaInelastic
> Process Type from Track creator process is 4
> Process Type Name from Track creator process is Hadronic
> Process Subtype from Track creator process is 121
> "
> 
> These conditions occur a handful of times during the run, before the crash, so
> there must be another factor at play, causing the bug to occur.
> The fault is caused by a problem involving the TSliceID variable in
> CCalSteppingAction::UserSteppingAction (located in CCalSteppingAction.cc).
> 
> A global time value is returned via PostStepPoint->GetGlobalTime() and this
> value rounded to an integer which is then used to reference the 200 member
> timeDeposit[] array and increment the appropriate energy to the corresponding
> member of the array.
> 
> However, I have found that a segmentation fault is occurring and causing the
> program to crash, due to this global time value returned spontaneously taking
> an extremely large value (after a random number of steps, usually several
> thousand, which varies with initial conditions). For example 2.0177806e+12 ns,
> or 2.525698e+11 ns, when it should be taking a value between 0 and 200
> nanoseconds.
> 
> This somehow causes the rounding operation which occurs in the following line
> to assign a large negative integer to TSliceID. This value is always the same,
> regardless of the global time returned from PostStepPoint, setting TSliceID =
> -2147483648:
> 
> TSliceID = static_cast<int>( (PostStepPoint->GetGlobalTime() ) / nanosecond);
> 
> The segmentation fault occurs when CCalSteppingAction::UserSteppingAction tries
> to write to the -2147483648th member of the 200 member timeDeposit[] array,
> here:
> 
> timeDeposit[TSliceID] += aStep->GetTotalEnergyDeposit() / GeV;
> 
> I am posting here to see if anyone knows why this extremely large value is
> being returned by PostStepPoint->GetGlobalTime (rather than a value in the
> range 0 to 200 nanoseconds), or if anyone can suggest a possible cause or
> solution to this crash.
> 
> To confirm this problem was present with a fresh install, I completely removed
> Geant 4 from OSX and reinstalled it, as per the instructions in the
> installation guide, to version 4.9.3.p02. The bug occurred again, unchanged,
> with completely unmodified example code. I then completely removed 4.9.3.p02
> and installed 4.9.5.p01 from scratch with the new cmake system.
> 
> The problem is also occurring independently on a Linux system (Fermi Linux
> lts30 INSTALL for FermiGenericDesktopOffsite), with an install of Geant version
> 4.9.4.p02. The bug and debug data occur exactly the same with this install. The
> CLHEP version installed here is 2.1.0.1, and version 16.11.5 of OpenScientist
> batch is installed.
> 
> NOTES ON CRASH WITH 4.9.5.p01 INSTALL:
> 
> When I run the example on 4.9.5.p01, I am getting a curious message output by
> the built-in G4 debug statements. Around 40 or 50 lines per particle, as
> follows. They were not there when I ran the program on 4.9.3 and 4.9.4, so I'm
> guessing they're due to a new feature, or new debug statements in the 4.9.5
> release:
> 
> "
> G4QCaptureAtRest: wrong product12 imax= 13 Pa226[0.0] 4-mom=
> (-34.4136,-110.082,31.561;210498)
> G4QCaptureAtRest: wrong product16 imax= 17 Pa222[0.0] 4-mom=
> (-82.737,-166.913,72.433;206768)
> G4QCaptureAtRest: wrong product17 imax= 18 Pa221[0.0] 4-mom=
> (113.399,-185.094,150.325;205835)
> G4QCaptureAtRest: wrong product7 imax= 8 Th231[0.0] 4-mom=
> (-72.03,-103.689,-86.2623;215164)
> ... (And so on)
> "
> 
> I don't know if this output is significant or not. But thought it prudent to
> include it here.
> 
> If this crash is resulting from a problem with my install, environment,
> supporting software, or a build error, I would ask if anyone has any
> suggestions for a possible solution.
> 
> Or indeed, if it turns out to be a bug in the example code, any help would be
> appreciated.
> 
> Best regards,
> 
> Dylan
Comment 2 Dylan 2012-07-19 00:33:52 CEST
Created attachment 178 [details]
composite_calorimeter output for segmentation fault, geant4.9.5.p01, macosx10.6.8
Comment 3 Dylan 2012-07-19 00:34:48 CEST
Hi Andrea,

The string you mentioned reads:

<<< Geant4 Physics List simulation engine: QGSP_BIC_EMY 1.1

I have not changed this from the default. I have also attached the full output of the crash, as occurs when compiling and running the unmodified example code on Mac OSX 10.6.8, with Geant 4.9.5.p01 installed.

The lines which read the following appear to be related to filling the tuple. Though they do not seem to have affected the histo output when I've printed or saved it.

"BatchLab::MemoryTuple::fill(float) : column -1 not found."

The Open Scientist installed with the AIDA interface included, is "osc_batch-16.11-Darwin-x86_64-gcc_421.zip", the binaries from this page:

http://openscientist.lal.in2p3.fr/download/16.11/

Dylan
Comment 4 Alberto.Ribon 2012-08-07 16:01:43 CEST
Dear Dylan,

I have investigated a bit the problem you reported, and I think there is nothing wrong with Geant4, but there are some protections that need to be added in the composite_calorimeter application.

In fact, neutrons can scatter in a calorimeter for long times (up to several minutes), so it is physical meaningful to get some very large times, given that they are expressed in nanoseconds. One can see these large times associated to any particle which is produced by an inelastic neutron interaction, where the neutron has scattered a lot in the detector before undergoing such inelastic interaction. So, times up to about 10^12 picoseconds, corresponding to about 1000 seconds are fine.

However, the composite_calorimeter application assumes implicitly that such long times do not occur, which is wrong, as you have demonstrated.

I have added a few, simple protections and these will be available in the next Geant4 version (G4 9.6, to be released in December).

In the case you have access to the Geant4 SVN repository, the fix is available as the tag: ccal-V09-05-00  (in examples/advanced/composite_calorimeter).

If you don't have access to the Geant4 SVN repository, here is the list of changes that you can apply on top of your current version:

1. method: CCalSteppingAction::UserSteppingAction 

if ( PostStepPoint->GetGlobalTime() / nanosecond > 1.0E9 ) TSliceID = 999999999;
else TSliceID = static_cast<int>( PostStepPoint->GetGlobalTime() / nanosecond );

2. method: CCaloSD::getStepInfo

TSlice = PostStepPoint->GetGlobalTime() / nanosecond;
if ( TSlice > 1.0E9 ) TSliceID = 999999999;
else                  TSliceID = (int) TSlice;

3. method: CCalHit::getTimeSliceID

if ( theTimeSlice > 1.0E9 ) return 999999999;
return (int)theTimeSlice;


(I have chosen "999,999,999" because it is a big number which can still be represented in a 32-bit signed integer).

Thanks for reporting the problem!

Regards,
         Alberto