Causes of Segmentation Faults when running RDG Data

Problems of this type tend to be caused by one of the two scenarios:

A) There is just a part of the interview that is trying to do something impossible. These typically tend to be of the form that there was actually an error in the program that was NOT caught by the compiler which then causes Survent to blow up when it tries to execute that part of the code. An example could be a large grid in the spec.

B) There is a slow core leak. There is some function that happens during this interview that is causing a small piece of core to be “lost” each time it is executed. Eventually enough core can be lost that pretty much anything Survent does can lead to a core dump. This typically is harder to find because the problem does not have to be where it is actually dying at.

Guidelines to help narrow down where the actual issue might be

1) Does the RDG run for quite a while and then blow up or does it only complete a few cases? The more cases it runs the more likely the problem is a B type.

2) Run RDG with a seed (12345 is what we often use). This allows you to produce the same results over and over again. In this way if it blows up on the 29th case, it should always blow up on the 29th case.

3) Run Survent with more core. If it is a type B problem, then when you run with more core, it will either happen later (even with the same seed) or it may not happen at all. If it is a type A problem, this very likely will not make any difference.

For example:

From the command line typesurvent con con core:5000000to run with 5 MB of core

4) Put logging=3 in the study header. This causes Survent to write and close the interviewer log file after every question. In general, you don’t want to run this way as it causes a lot of extra overhead on the system. However, in this case, the log file should at least show the question right below the one it is blowing up on. If this is a type A problem, it will likely pinpoint it pretty closely. For instance, if the last thing you see are the questions before a large grid, it is very likely the grid that is the issue.

If a large grid looks like it is the issue, then there are a couple of things to think about.

First, if the grid actually consists of a lot of questions with conditionals on it, is it feasible that you would actually have that many be true on a real survey. For instance, you might have a brand list that most people would choose 1 or 2 brands and even the rare outlier might choose 10, but you might be allowing for 100 answers and that is how many RDG sometimes is going to pick. If this is the case, you might try reducing the maximum to a more realistic value.

However, if the grid is not mostly made up of conditional questions or it is realistic for someone to actually satisfy most/all of logic, then you need to run survent with more core.

If the extra logging and the extra core do not help, you likely will want to package everything up (qpx, raw sample file, and ASCII phone header) and send it to us. We will need to know the exact compile number you are on (top of any Mentor list file) and the seed number you gave RDG if you used one. Hopefully we will be able to reproduce the issue here. If so, we can run with some extra diagnostics set that can help us locate the core leak.