When the Site Server is moderately to heavily loaded "Task Complete" events get lost. All events can fall prey to the problem, but the "Task Complete" event is the most problematic.
We discovered a “thread-race” condition. The effect of the thread-race condition was that the “task complete” event could get lost. This would only happen on a Site Server that was experiencing a certain level of load. The series of conditions required for the problem to happen go like this: the Site Server sends a task to a client machine, and the client machine performs the task and responds with the “task complete” message quickly (usually within 2 or 3 seconds), the Site Server has an “Event Processing Loop” with a 5 second wait in it, so if the task starts and completes during that 5 second delay, then the “task has started” and the “task has completed” events begin processing on separate threads one right after the other. When this happens it becomes kind of like a roll of the dice. If thread one gets ahead of thread two then the events both get processed in the correct order and all is fine.
If thread two gets ahead of thread one then both events get processed but out of order, we handle this and the job continues. If the “task has completed” thread overwrites the “task has started” event then the job continues.
However, if the “task has started” thread falls behind but only just barely then it can overwrite the “task has completed” event. When this happens the client knows it is done with the task, but the Site Server believes it has only started the task. This causes the Site Server to wait 60 minutes for the task to “retry”, then resend the task to the client.
This creates several different symptoms. First is the 1 to 4 hour delay we saw where sometimes it would still complete. The second is sometimes it would delay until the task was removed from the NS as too old. The third and more common is for tasks that have the default 30 minute timeout set. These tasks reach the 30 minute “timeout” and are killed before the 60 minute “retry” comes into play. These tasks are reported as “failures” even though they may have completed successfully.
The code fix for this issue involved moving the functionality to pass the event to the thread that processes events for a task to a location in code where it could be contained within a mutex semaphore with the thread wakeup process that causes the event to be processed. (It should be noted this thread is different from the two threads above that are in a thread-race condition.) This prevents one thread from overwriting the event from another thread. This will slow large “Jobs” with multiple tasks but speed up small or single task items that are being processed on the Site Server.
Also of note: The “Event Processing Loop” would wait after the 5 second delay for a “one or more events present” event. In a lightly loaded or test environment this is where the loop would spend most of its time. This has the effect of having the Site Server ready to process the “task has started” event as soon as it is posted. When this would happen there would be a 5 second delay before the “task has completed” event could be processed. This is why the issue would never happen in a lightly loaded environment.
This issue has been reported to the Symantec Development Team. This issue will be addressed in the next major release (currently targeted for SMP 7.1 SP2 MP1 and ITMS 7.5).
There is a pointfix available. Please see attached "Pointfix_eTrack2903733_7.1_SP2v4.zip"
Installed ITMS 7.1 SP2 v4
HOW TO INSTALL THIS POINTFIX:
1. Download "Pointfix_eTrack2903733_7.1_SP2v4.zip".
2. Put script and executables in one folder without any other files.
(on the screenshot below it is New Folder on Desktop)
3. Install Software Management Solution plug-in on Remote Task Server
4. Create a new software resource by importing the files. (Right click on “Installed Software” pane, Select the files that we provide to the customer in step 1. Set the Installation File to *.cmd file. Save the Software Resource)
5. Create Quick Delivery task
6. Select Task Server(s)
7. Click “OK” on Quick Delivery Task.
This will run Quick Delivery Task on Task Servers, execute batch file and update DLL.
The code fix for this issue involved moving the functionality to pass the event to the thread the processes events for a task to a location in code where it could be contained within a mutex semaphore with the thread wakeup process that causes the event to be processed. This prevents one thread from overwriting the event from another thread. This will slow large “Jobs” with multiple tasks but speed up small or single task items that are being processed on the Site Server.
This fix is for Remote Task Servers only. Fix should be reapplied if new Remote Task Server is created.
Symantec Management Platform 7.1 SP2
Task Management 7.1 SP2
Pointfix_eTrack2903733_7_1_SP2v4.zip (699.6 KB)