qd Explanation
This page is an explanation of the output of the
qd
program.
It discusses what information is available in the
queue.dat file, where
qd gets
additional information, and special notes about
each field of the
qd printout.
Here is a sample of the output.
For brevity, the data for indices 9, 0, 2, 3, 4, and 5 have been omitted:
qd released 20 May 2005 (fr 030); qd info 20 May 2005 (rph)
Queue version 5.01
Current index: 7
Index 8: finished 49.00 pts (0.353 pt/hr) 2.94 X min speed
server: 171.64.122.117:8080; project: 724, "p724_Abeta21-43-amberGS"
Folding: run 1, clone 19, generation 4; benchmark 797; misc: 500, 300
issue: Mon Aug 9 23:36:42 2004; begin: Mon Aug 9 23:41:10 2004
end: Sun Aug 15 18:19:58 2004; due: Thu Aug 26 23:41:10 2004 (17 days)
core URL: http://www.stanford.edu/~pande/Linux/x86/Core_78.fah (V1.68)
CPU: 1,0 x86; OS: 4,0 Linux
assignment info (le): Mon Aug 9 23:36:38 2004; AC1213B0
CS: 171.67.89.100; upload failures: 1; P limit: 5241856
user: rph_iv; team: 0; ID: 3539220A7A46065D; mach ID: 1
work/wudata_08.dat file size: 1290111; WU type: Folding@Home
...
Index 1: finished 6.07 pts (0.212 pt/hr)
server: 171.64.122.125:80; project: 799, "SH3ligGH2/pdb1gbq.1.spa"
Genome: unit 36476, 7554, 656; benchmark 739; misc: 500, 200
issue: Fri Aug 27 13:06:35 2004; begin: Fri Aug 27 13:06:47 2004
end: Sat Aug 28 17:45:24 2004
core URL: http://www.stanford.edu/~pande/Linux/x86/Core_ca.fah
CPU: 1,0 x86; OS: 4,0 Linux
user: rph_iv; team: 0; ID: 3539220A7A46065D; mach ID: 1
work/wudata_01.dat file size: 63721
...
Index 6: finished 30.10 pts (0.273 pt/hr) 3.04 X min speed
server: 171.64.122.119:80; project: 682, "p682_TZ2_NAT_VISC0MD"
Folding: run 14, clone 48, generation 7; benchmark 768; misc: 500, 200
issue: Sat Sep 18 01:47:03 2004; begin: Sat Sep 18 01:47:11 2004
end: Wed Sep 22 16:14:20 2004; expire: Sat Oct 2 01:47:11 2004 (14 days)
core URL: http://www.stanford.edu/~pande/Linux/x86/Core_65.fah (V2.53)
CPU: 1,0 x86; OS: 4,0 Linux
assignment info (le): Sat Sep 18 01:47:02 2004; AC125A56
DL: Sat Sep 18 01:47:09 2004; upload failures: 1; P limit: 5241856
user: rph_iv; team: 0; ID: 3539220A7A46065D; mach ID: 1
work/wudata_06.dat file size: 100545; WU type: Folding@Home
Index 7: folding now 51.40 pts (0.341 pt/hr) 5.73 X min speed; 11% complete
server: 171.64.122.111:8080; project: 334, "p334_unf_305"
Folding: run 1, clone 88, generation 13; benchmark 770; misc: 500, 300
issue: Wed Sep 22 16:16:20 2004; begin: Wed Sep 22 16:16:56 2004
expect: Tue Sep 28 22:59:03 2004; due: Thu Oct 28 15:16:56 2004 (36 days)
core URL: http://www.stanford.edu/~pande/Linux/x86/Core_78.fah (V1.68)
CPU: 1,0 x86; OS: 4,0 Linux
user: rph_iv; team: 0; ID: 3539220A7A46065D; mach ID: 1
work/wudata_07.dat file size: 197433; WU type: Folding@Home
Average download rate 5.267 KB/s (u=4); upload rate 3.403 KB/s (u=4)
Performance fraction 0.819036 (u=4)
Average pph: 0.329, ppd: 7.89, ppw: 55.2, ppy: 2663
Explanations of each field will follow, with the data for that field shown
again in front of each explanation.
Most of the fields will be taken from the index 8 entry, with references to
the other indices when appropriate.
Not all data will be present for each work unit, and unavailable data will
usually lead to the field which would present it being absent in the
printout.
There are also some fields which may not seem to be useful.
They are printed simply because they are present in
queue.dat, and
qd prints
everything which is there which can be interpreted.
As
qd prints information for each entry, it will try to
find additional information about that unit by looking at other files.
It will look at the core log file
work/logfile_**.txt
for the core version number and the project name, and if it's the current unit,
for its progress.
If it can't find the name of the current unit that way, it will look
in
work/current.xyz.
If it can't find the progress, it will look in
work/wuinfo_**.dat.
And if it still can't find the name, or it doesn't have information about
the unit's expiration time, it will look for those things in
emprotz.dat or
emprotx.dat.
If any of these files can't be opened or read successfully,
qd won't consider it a serious error.
It simply won't have the data it might otherwise be able to print.
If
qd is run with the
-p
flag, it
won't look at these other files.
Similarly, the
-q
flag, implying it is looking at an isolated
queue.dat file not in the context of a working folding
directory, will suppress looking at these other things too.
The
-c
flag produces a more compact format, in which about
half the lines of each queue entry are omitted.
This format might be preferable when using
qd simply
for looking up work unit history.
There is also a one-line format produced by the
-l
or
-L
flags, which is suitable for saving in a sorted log file,
possibly combining the folding history from many clients.
There are eight types of date fields which can be printed for work units.
These dates are normally printed in local time, according to the way the
operating system is configured.
However, if the
-t
or
-z
flags are used, they
will be printed in the specified time zone.
When output from
qd is used for reporting problems, it
will be most useful if the
-z
flag is used so the times are
printed in UTC.
After the output field descriptions, there follows a shorter section listing
all the
status codes qd and the
client use for queue entries.
Output fields
qd released 20 May 2005 (fr 030)
The program is released as necessary to keep up with changes and new
features.
Often it is several times a week, as new project numbers are activated
and their scores have to be added to
qd's tables.
If there is a significant change, or a bug fix, there is also a "functional
revision" level built into the program, which will be increased.
The message above shows the program functional revision level at "030".
The latest functional revision level available is given in the
qdinfo.dat file as an "fr" field.
Users who update only
qdinfo.dat to stay current will be
notified with a warning message here if
qd notices an
"fr" level greater than its own functional level.
qd info 20 May 2005 (rph)
If
qdinfo.dat is available, it will be read to update
the built-in point table.
The text of a "da" entry will be printed here.
The official distributed
qdinfo.dat has the tag "(rph)"
at the end.
People maintaining their own versions of the file, for whatever reason,
should use a different tag.
The date the table information was compiled is recorded in
qdinfo.dat as a hex value in a "pg" entry.
If such a value is present and it is earlier than the date of the built-in
tables in
qd, then a warning message will be printed
here, and the table entries in
qdinfo.dat will not be
allowed to overwrite the built-in values.
Queue version 5.01
The client stores its own queue revision level in
queue.dat any time it creates a new
queue.dat or updates an old one to a newer format.
The format does not change every time a new client comes out, so the version
number shown here may be perfectly up to date but still not show the latest
client revision level.
It would not be an error.
If the
queue.dat file being interpreted is not from the
same type of system that
qd is being run on (strictly
speaking, a system different enough that the file requires format conversion),
the system type will also be shown on this line.
It will be given in parentheses, as "(Linux)", "(Windows)", or "(Mac)".
Current index: 7
When a work unit is downloaded, it is assigned to one of the ten queue
entries.
The data in
queue.dat at that index is used by the
client to keep track of that unit's status.
It is especially needed if the client is stopped and restarted, or if
finished work doesn't upload on the first attempt.
This index number is also used in file names in the work subdirectory.
In this example, since the current index is 7, the work file names have
the string
"_07" in them.
The indices are assigned sequentially, starting at 1, and cycling back to 0
after 9.
If work is still active at the next sequential index (most likely because
it has had a lot of trouble uploading), that index will be skipped over.
An active index won't be reused unless there are no inactive ones left at all.
Index 8:
When
qd formats its output, it starts with the oldest
entry, and proceeds forward until the last printed data is for the current
index.
finished
The printed status is an interpretation of an integer status code for each
entry, sometimes influenced by other data fields in that entry.
The names given to these status codes by
qd are a
little different from the names printed by the client with its
-queueinfo
flag.
If
qd is started with the
-h
flag,
it will print out an explanation of its status code names.
See below for a
summary.
49.00 pts
This is the point value for the unit, as well as
qd
can determine it.
The value isn't stored anywhere at all by the Folding@home client, so
qd looks it up in its internal point table, after
augmentation by data in the info file
qdinfo.dat or
even from
emprotz.dat or
emprotx.dat.
The valuation of Genome units is calculated from the protein chain length,
according to the published formula, suitably scaled depending on whether
the team number recorded with this unit is for a Folding or Genome team.
The score will be unavailable for Genome units if
work/logfile_**.txt isn't present for that index,
since the chainlength can only be found in the log, which is often deleted
automatically when the unit uploads.
Since the Genome project has now ended, it is likely that very few Genome
scores will be calculatable.
If the table lists several values for units of the relevant project,
qd will select the value in effect, as far as it
knows, when the unit was issued.
The table is not authoritative, so the score shown here might not correspond to
what is actually awarded by the work server if the value had been changed at
a time near to the time the unit was issued.
(0.353 pt/hr)
If points and progress can both be determined for this work unit, then the
score rate can be calculated from the beginning time and ending time, or
for the current unit, the modification time of the core log file.
In this sense, completed units are taken to have made 100% progress.
Points are determined as described in the paragraph above.
Progress is considered valid only if the total elapsed time is between fifteen
minutes and fifty days, or for the current unit, at least thirty seconds.
See the note below explaining how the fraction complete, and thus also the
score rate, may be inaccurate before the unit is 2% done.
2.94 X min speed
If progress and expiration time can be determined for this work unit, then
the rate of progress can be calculated as a factor times the minimum speed
necessary to barely make the deadline.
If this number is less than 1.00, it means that if the unit continues to
process at the effective rate it has gone since it was downloaded, it
won't finish in time.
If the machine has been running FAH exclusively 24/7, then it is hopelessly
too slow.
The current unit might as well be deleted, and the client should be
reconfigured to process only deadlineless units.
Before the V5.00 client, that was done by selecting a preference for Genome
units.
See the note below explaining how the fraction complete, and thus also the
min speed factor, may be inaccurate before the unit is 2% done.
11% complete
[see index 7]
The current unit will show the fraction complete, if it can be determined.
NOTE that with regard to these last three items, until the unit is several
percent finished, the displayed values may not be very accurate.
Especially with Gromacs units, where there are checkpoints and log entries
made between frames, qd can't establish the
calculation rate until the unit is at least 2% done.
This will cause the percent complete to show only 0% or 1%, and the point/hour
rate and min speed numbers to be significantly lower than their proper
values.
Further, with Genome units, there is quite a bit of time spent in
initialization as the rotamers library is calculated.
At the beginning of the unit, this time all appears to be taken by the first
tenth of the first sequence, which can make the rate of progress initially
appear to be as low as one half of its ultimate value.
This also affects, of course, the expected completion time for the unit.
server: 171.64.122.117:8080
This is the IP and port address of the data server from which the work unit
was downloaded.
It is also the address to which the unit must eventually be uploaded if it
is still active.
project: 724, "p724_Abeta21-43-amberGS"
This is the number and name of the project which the work unit is part
of.
The project name is not stored in
queue.dat, and old
core log files often get deleted by the client, so it is common for the
project name not to be known except for the current unit.
If the project name can be found only in
emprotx.dat,
then it will be printed here, but in single quotation marks.
In this case, the name, strictly speaking, is really only a guess, being
simply the first name
qd found in
emprotx.dat which starts with "P" or "p" and then a
string of digits matching the project number.
On rare occasions the project names change, and there is no guarantee that
the name selected here is the name of the actual finished unit.
This issue will soon be moot, since
emprotx.dat is
being replaced by
emprotz.dat, which doesn't contain
the name string at all.
Folding: run 1, clone 19, generation 4
If the unit is not using
Core_ca (the Genome core),
these numbers, along with the project number, uniquely identify the work
unit.
If it is necessary to notify Stanford of the work unit ID, those four numbers
(project, run, clone, generation) should always be given.
Genome: unit 36476, 7554, 656
[see index 1]
If the unit is using Core_ca (the Genome core), these
numbers are printed, because they are stored in
queue.dat, but as far as we know, they can not be
used to uniquely identify the work unit.
The sequence name and initial seed value are needed for Genome, and they
aren't stored in queue.dat.
benchmark 797
When the client starts, it calculates this benchmark number.
It is a measure of floating point processing power, but it is also strongly
affected by any other programs which happen to be running on the machine at
the time.
It's normal for the benchmark number to vary over a wide range.
misc: 500, 300
These numbers are present in
queue.dat, but it isn't
known for sure what they are.
Most likely they are related to the revision levels of the client which are
suitable for this work unit.
issue: Mon Aug 9 23:36:42 2004
This time stamp is placed on the work unit by the work server when it assigns
the unit.
It is an epoch 1970 value, based on the server's clock.
The servers lately have been accurately synchronized with NTP.
If a unit, for any reason, gets lost so it isn't either completely finished
or ready for upload when the next unit is requested, the server will often
notice the previous assignment and send the same unit again, with the
original issue time stamp.
begin: Mon Aug 9 23:41:10 2004
This time is recorded by the client when it finishes downloading a new unit
to process.
It is an epoch 2000 value, based on the local machine's clock.
If the clock is set accurately and the work unit is not a duplicate, the
"begin" time will be only a few seconds after the "issue" time, the
difference being simply the time it took to receive the unit over the Internet.
end: Sun Aug 15 18:19:58 2004
This time is recorded by the client when the core returns with an indication
that the processing has finished and the results file should be uploaded to
the server.
It is an epoch 2000 value, based on the local machine's clock, just like
the "begin" time.
expect: Tue Sep 28 22:59:03 2004
[see index 7]
This time, printed only for the current unit, is not stored in
queue.dat, but instead is a projection made by
qd if the progress can be determined.
This date shows when the unit will finish if processing continues at
exactly the speed it has run so far since the unit began.
See the note above under "fraction complete" explaining how the rate of
calculation may not be accurately determined before the unit is 2% done.
due: Thu Aug 26 23:41:10 2004 (17 days)
This time is calculated by the client when it downloads a unit.
It is determined by adding the "begin" time to the expiration period, if this
expiration data is sent as part of the work unit.
The interval, which
qd shows in parentheses after the
date, is rounded to the nearest day, or to the nearest hour if it is less
than two days.
Genome units had no deadlines, nor do a few Folding units designated for
machines which are either slow or only infrequently connected to the
Internet.
All other Folding units do, but not all work units have the expiration period
fields filled in.
The stored time is an epoch 2000 value, based on the local machine's clock.
There are actually two sorts of deadline associated with a work unit.
The one reported here, which is the only one known to the client, is what the
Stanford project summary web page calls the "Final Deadline", after which time
the client may delete the unit and no credit will be given even if it is
completed and returned. The other deadline, shown on the project summary as
"Preferred", indicates when the unit will be considered late enough that
the server might send another copy of it to someone else, which means that
a unit returned after that time might have little scientific value.
NOTE that if the machine clock runs, or is set, forward to a time past the
"due" date, the client might think the unit is past due, and delete it.
expire: Sat Oct 2 01:47:11 2004 (14 days)
[see index 6]
This time is equivalent to the "due" date, but it is reported as "expire"
when the client has not calculated it itself.
Normally this means the work unit didn't contain any expiration data, so the
client couldn't calculate it, but qd was able to find
out how much time is allowed by reading the
emprotx.dat or emprotz.dat
file.
This distinction might be important, since often the deadline stored in
the EM III data file is the shorter "Preferred" one,
but in any case, the unit will never be in danger of deletion, since the
client doesn't know any deadline at all.
core URL: http://www.stanford.edu/~pande/Linux/x86/Core_78.fah (V1.68)
This full URL is actually constructed by
qd (and the
client) from the directory URL and the core number, both of which are stored
in
queue.dat.
It can be used to verify that the server knows the correct client OS type,
and also to determine which core is required for this work unit.
At present,
Core_65 is Tinker (Folding),
Core_78 is Gromacs (Folding),
Core_79 is double-precision SSE2 Gromacs (Folding),
Core_82 is PMD Amber (Folding),
Core_96 is QMD (Folding),
and
Core_ca used to be SPG (Genome).
The core version which was most recently run on this unit is shown in
parenthesis after the URL, if it can be determined.
It is parsed by
qd from the core log file, which is
deleted if the unit uploads on the very first try, so on some systems which
have permanent connections to the Internet, the core version will only rarely
be shown at indices other than the current one.
CPU: 1,0 x86; OS: 4,0 Linux
The CPU and OS types are detected by the client and sent to the servers
when requesting assignments.
They can influence what sort of work will be assigned, and which core URL
is given to the client.
Each of these types is reported as a pair of numbers, the first of which is a
basic type defined by
Cosm, and the second of which is
a subtype determined by the client.
The most detail is available for Windows systems running on Intel CPUs.
The numbers are sometimes stored in little-endian order in
queue.dat, and sometimes in big-endian order, with no
clearly consistent pattern.
If
qd recognizes the pair of numbers, it prints a string
which interprets them.
assignment info (le): Mon Aug 9 23:36:38 2004; AC1213B0
This time stamp and data word are sent to the client by the assignment server
along with the IP address of the work server and other information.
The data word is a checksum derived from the IP address and the time
stamp.
The client sends all three words to the work server as validation of the
assignment.
These values can be stored in
queue.dat in either
little-endian or big-endian byte order.
It is unknown why it can be either way, but the notation in parentheses,
either "(le)" or "(be)" indicates how it actually is stored in this entry.
CS: 171.67.89.100
This is the collection server IP address, which is not always present.
If it is given, it identifies an alternate server to which the results can
be returned if there are too many errors attempting to upload to the actual
work server.
If work is returned to the collection server, credit for it will occasionally
be delayed in the Stanford statistics, but this is better than having the
work deleted by the client if its deadline expires before the server can be
reached successfully.
upload failures: 1
This field will not be printed for the current unit, or for anything other
than a unit pending upload if the number of failures is zero.
The client uses this field to decide when to try sending results to the
collection server instead of to the actual work server.
P limit: 5241856
This is a packet size limit, used to determine how much memory should be
allocated when returning results.
Its default size seems to be about five megabytes, but if "large units" are
selected, the client will set the value to ten times that size.
If the actual results file is bigger than the packet limit, the client will
be unable to return it to the server.
DL: Sat Sep 18 01:47:09 2004
[see index 6]
This field is usually absent if a collection server address is given for this
unit.
It seems to be a date roughly corresponding to the end of the download, from
the work server's perspective.
It is an epoch 2000 value, most likely based on the server's clock.
user: rph_iv
This is the UserName as configured when the unit started.
The UserName actually sent to the server when the unit is uploaded is what
is current at that time.
team: 0
This is the TeamNumber as configured when the unit started.
The TeamNumber actually sent to the server when the unit is uploaded is
what is current at that time.
ID: 3539220A7A46065D
This ID is used by the server to identify the user's machine, and to validate
the work unit when it is returned.
The number stored in
queue.dat, and printed here,
is really the sum of the assigned UserID and the configured MachineID,
as it was when the unit started.
The ID actually sent to the server when the unit is uploaded is what is
current at that time on the machine which does the upload.
Ideally, it should match the ID to which the unit is assigned, to within the
allowed configurable range of the MachineID, or the uploaded unit might be
rejected.
Recent experiments with transferring work among machines have shown the
servers usually to be more permissive, but relying on this would probably
be a bad idea.
mach ID: 1
This is the MachineID as configured when the unit started.
When the MachineID and UserID are added together and sent to the server with an
uploaded unit, the client uses the MachineID which is current at that time.
work/wudata_08.dat file size: 1290111
This is used as one of the checks for a proper work unit download.
WU type: Folding@Home
This actually isn't a work unit type at all, but simply a text message which
the data server can insert in a downloaded unit.
It usually just says "Folding@Home", and sometimes nothing at all.
Average download rate 5.267 KB/s (u=4); upload rate 3.403 KB/s (u=4)
These rates are measured by the client as a benchmark of system network
performance.
They are a sliding window average weighted over the recorded number of
units.
This number is capped by the client at four, so the average tends to track
more recent performance.
It doesn't mean that only the last four units are averaged, but rather that
the stored value is weighted four times greater than the value from a new
unit, so in fact it determines the characteristic rate at which the effects
of older units decay.
Performance fraction 0.819036 (u=4)
This value is calculated by the client as a benchmark of the system's
ability to complete work units quickly.
It is a sliding window average of the fraction of the deadline time remaining
when a unit is completed, weighted over the recorded number of units.
The number of units is capped at four, exactly as it is for the network
download and upload rates.
Average pph: 0.329, ppd: 7.89, ppw: 55.2, ppy: 2663
This line appears at the bottom of the report as a projection made by
qd of how many points this machine should score over
intervals of an hour, day, week, and year, if it continues to earn points at
the rate it has earned them while calculating the units reported in the
preceding printout.
It is a rough indicator of how effective this machine is at calculating
Folding@home work units.
The final "ppy" figure corresponds to the "Annual score benchmark" printed
by
qd prior to functional revision "018".
Status codes
Here is a list of all the possible status codes that can be printed for a
queue entry.
When
qd was first written, the client had no
publicly-available names, so
qd made up names
according to the status of the work unit, from the perspective of the
user.
The following list is available if
qd is started with the
-h
flag:
The status code for each queue entry may be interpreted as follows:
(0) empty
The queue entry has never been used, or has been completely cleared.
(0) deleted
The unit was explicitly deleted.
(0) finished
The unit has been uploaded. The queue entry is just history.
(0) garbage
The queue entry is available, but its history is unintelligible.
(1) folding now
The unit is in progress. Presumably the core is running.
(1) queued for processing
The unit has been downloaded but processing hasn't begun yet.
(2) ready for upload
The core has finished the unit, but it is still in the queue.
(3) DANGER will be lost if client is restarted!
Bug before V3b5, neglected to post status (1).
(3) abandoned
Bug before V3b5, neglected to post status (1), and client was restarted.
(4) fetching from server
Client presently contacting the server, or something failed in download.
If this state persists past the current unit, the queue entry will be
unusable, but otherwise things will go on as usual.
(?) UNKNOWN STATUS = ??
Something other than 0 to 4.
The numbers in parentheses are the actual value used for that status
code.
Code (0) is called four different names depending on other data found
in the queue entry.
Several months after all this was written and working as described, the
guys at Stanford added the
-queueinfo
option to the client,
and they gave their own names to these same codes according to the queue
entry, from the perspective of the client, describing what it needs to do
about it.
Status code (3) really never happens any more, but here is what it calls
the others.
(0) empty
The queue index is available for reuse by a new unit.
(1) active
The unit files are actively processing, and the core should be running.
(1) ready
The unit files are ready for processing but are queued, and the core
should be started to process them when earlier units are finished.
(2) finished
The core has finished the unit, and it needs to be uploaded.
(4) fetching
The index is allocated to a unit being fetched from a server.
The "fetching" status was not handled correctly until client version
5.0, which will free the index if it finds that status in the queue.
If an earlier client was restarted and discovered the "fetching" status
at one of the queue indices, it simply skipped over that index for ever
afterward, never reclaiming it for further use.
It's unfortunate that the word
finished is used with differing
meanings.
In the output from
qd, "finished" means it's really
all done, and the information still stored is just history.
But if it's the
client which says "finished", it means only that the
processing has finished, and the work has
not been sent back yet.