About Event Queues in ITMS 7.1 SP2
New Event Queue features in 7.1 SP2
New Event Queue processing features in 7.1 SP2
How to use the EventFailureBackupFolder and EventCopyFolder settings
NSE Flow Diagrams
This article covers how Event Queues work in ITMS 7.1 SP2. Event Queues, or Message Queues, store and queue some Notification Server events while those events are waiting to be executed.
With the release of ITMS 7.1 SP2, the following general improvements were made:
For information about where Event Queues were located prior to the 7.1 SP2 release, see the following article: www.symantec.com/docs/HOWTO45754
Changes will be applied after all services are restarted. The recommended way to change path values is as follows:
NOTE: Do not redirect the EventQueues to the NSCap directory structure as it may stop NSEs from being processed. This is because old NSCap folders are monitored by an event dispatcher for legacy NSEs, which are put as files into subfolders. If you point the dispatcher to one of the subfolders, it will create files there and they will immediately be caught as new NSEs from legacy solutions and routed again to the dispatcher.
For more information about this issue, please see the following article: www.symantec.com/docs/TECH183959
With the release of ITMS 7.1 SP2, some improvements were made to accelerate and improve NSE processing in the Event Queues.
Because the existing queuing structure did not support the ability to attach meta data to each event, the server side changes are extensive. The queuing system had to be rewritten to register each event in the database and dispatch events with the constraints that no two NSEs of the same source guid and priority level can be processed simultaneously (one exception being that the empty source guid retains backward compatibility and these NSEs can process simultaneously). Additionally, same source guids and priority levels are processed in the exact order that they are registered into the database.
If you sort tmp files by date created, the events will be sorted in the order that they were sent to the server. By using priority and ignoring the Event Queue, you can cause certain events to be sent to the server sooner.
The default number of retries is three. This was put in place because the strict queue ordering causes the previous customer strategy of copying failed NSEs back into the inbox to break queue ordering.
For more information, see the following section of this document: “How to use the EventFailureBackupFolder and EventCopyFolder settings”.
If you don’t see this setting in the registry, it means it’s not effective or not used. However, you can define it manually by adding an EventFailureBackupFolder string value under the key HKEY_LOCAL_MACHINE\SOFTWARE\Altiris\eXpress\Notification Server.
As soon as you add it, the setting will be active. If any of the NSEs fail to process (with retries), the file will be put into that folder. The appropriate subfolder will also be created, depending on the type of exception.
You can define this setting manually by adding an EventCopyFolder string value under the key HKEY_LOCAL_MACHINE\SOFTWARE\Altiris\eXpress\Notification Server.
As soon as this setting is effective the NSE processing will copy each NSE to this folder. Please note that it may take some time for this setting to be effective as it must wait until the Core Settings checks values for changes in the registry.
Q: Do you know why truncating ResourceMerge helped with the Queue processing? Should we have some type of check to avoid this type of issue in our code?
A: The ResourceMerge table is a weak spot we found recently. It keeps the records of merged resources pointing out what resource guids were before and after a merge.
The procedure, which hangs, is doing loops trying to find out the current resource guid while it can find its parents, and this logic could lock the whole DB if records become cycled, like this:
Resource A > Resource B
Resource B > Resource C
Resource C > Resource A
The only reason why this could happen is if some race conditions in resource merge logic in c# code or some service crashed. We don’t have any 100% proof way to avoid situations like this.
Resource merges are mapped into a table now, which keeps track of old resource guids to current guid mapping, allowing a fast lookup of old resource guids. Resource association and data class importing have been changed to use this table to map the incoming resource guids to their merge targets, if any such targeting exists. The current implementation has one known limitation: If a resource merge occurs on a resource which is referenced as a foreign key, it will not be remapped.
Q: Is the message processing now single-threaded rather than multi-threaded as it traditionally was? Is there at least a separate thread for each queue since technically the messages are still sorted into queues within their table entries in dbo.EventQueueEntry? The behavior over the last two days seems to indicate a single-threaded processing model which does not seem very efficient given how easily our two servers became backlogged on NSEs due to processing issues on one of the servers.
A: The NSE processing is multithreaded. It was single-threaded in SP1, not vice-versa. We see it slow recently (and looking like single-threaded) only because DB locks occur while processing resource merge logic, which efficiently disallows resource-specific tables to be accessed. While merge was locking, none of the other NSE threads could do anything because the DB was locked.
Q: In the event of a database communication outage, which could last for quite some time and potentially require a restart of the SMP services, how quickly do the failure retries occur? Do the messages get placed back into the queue and then wait for a later retry interval or do they immediately re-submit? If they are immediately re-submitted, as they appeared to do today, then we will most likely lose all of the NSEs that were submitted during the outage. This is not a good idea. The CMDB could be missing data until the next full inventory; in the case of critical software updates and/or distributions I will have no idea if they were successfully installed.
A: The default retry limit for NSE is three times and the delay between tries is pretty small (RetryNumber * 100 ms). After that, if the option is set, the NSE will be backed up.
If the DB is out of order the NSEs will not be put into the queue at all. Actually, a ‘server busy’ message will be returned to the client on the NSE post, and the client should handle this situation gracefully.
Q: How are the queues supposed to work now? I found that when I manually copy NSEs into the EventQueue\EvtQueue folder, nothing happens. I have to copy them into the eventqueue\EvtInbox folder. That process moves them into the EventQueue\EvtQueue folder directory. I see the client posting to the EventQueue\EvtQueue directory.
A: Do NOT put files into EvtQueue. The safest way (if you need manual NSE) is to use EvtInbox. Then, the message will be routed into EvtQueue automatically (no matter what size if it’s over 3KB), but its priority will be set in DB accordingly.
Q: Are bad and process folders no longer used?
A: There is no evidence of the process or bad folder for NSE; instead, mapping is used inside the BadNSEFolders.config (located in main NS Core configuration folder). This will create specific folders for multiple types of exceptions that might have occurred if it were unable to route the NSE in three retries. This works only when the NSE backup is set (EventFailureBackupFolder in core settings) and the subfolders will be created under the folder as specified in this setting.
Q: Are EvtQFast, EvtQLarge, EvtQPriority, and EvtQSlow no longer used?
A: Correct, they are not used internally. However, if someone put some NSEs into the folder manually they will be put into main queue in DB and moved to EvtQueue. This is done for backward compatibility with some older solutions. Some older solutions don’t use the newest NS API.
Q: Is EventQueue\temp still used for decompression of larger NSEs like in NS 6?
A: There is no evidence in the code to use the temp folder there. The data from post.aspx will go to the SMP temporary folder first (if it’s over 3KB), then it is decompressed into the EvtQueue folder directly (with RANDOM-GUID.NSE file name), and then registered in Event Queue DB table with this file name.
Q: Describe how to use the feature Event Failure Backup folder. I see the setting in coresettings.config which seems to reference a registry setting. Do I create a registry string value with the value of where I want the backup folder to be? Exactly how is this implemented?
A: Yes, the code query for core setting indicates that this setting is in a registry. If there is something like a non-empty string, then it goes like this:
1.1 - If the value is rooted path, i.e. with a driver letter, then it is being used as is (+ the subfolder for exception, which we map have in BadNSEFolders.config).
1.2 - If the path is relative, then we create a folder under the EventQueue folder.
1.3 - The code writes the NSE itself into this place, along with an extra file with the same name and .log extension. If it was an exception which led to an event backup, the log file body will be an exception message itself.
Q: The default retry limit for NSEs is three. To me, default implies that this value can be changed. How do I modify this value to reduce the number of NSE retries? Under HKLM/Software/Altiris/eXpress/Notification Server I see some keys, but none of them seem to apply to the NSE retry limit.
A: Retry limit can be set by Core Settings in the EventRetryLimit entry. By default, it is absent from the settings so the hardcoded default of three is used.
Q: If we won’t post files to the EvtQueue if the DB is not connected, then does the agent have a trigger that says the DB is not ready and so the NSE will not be sent or delayed for X amount of time or is the agent just going to continue to send NSEs the NS regardless? If the files are sent regardless do they just get rejected?
A: Server response should indicate to the client that the server is busy it can’t register and save the NSE that was posted. It’s up to the agent’s logic to handle the case and the client should retry after some time.
Q: Do SQL deadlocks and timeouts also prevent NSE files from being posted? If files are posted and the DB connection gets dropped for some reason is this going to cause the current set of NSE files in the EvtQueue to delete or become invalid?
A: If the connection is dropped when the NSE is not yet registered in DB, the same “can’t handle NSE” will be returned to the client. However, if the event is registered, it will not somehow die/disappear if the connection drops. The file will hang in EvtQueue (if it’s large enough) and the entry about this NSE will stay in DB until it is processed. If the DB is unstable or just too busy while the NSE is registering, it will be retry to register 3 more times. However, since the HDD usage for queues is checked prior to saving the file, the server may return with busy if there is no more space and it will not try to register again.
Q: Where is the queue limit held and what is the queue limit (queue size)?
A: The only limit for the NSE queue is in the Core Settings: MaxFileQSize(KB), which only limits the total size of all NSEs stored as files in EvtInbox. There are no limits on DB entries for NSEs. The default value of this setting is 512000 (500 MB).
Q: Is it recommended to put 0 (unlimited) into …/express/Notification Server/ MaxFileQSize(KB) regkey to process more NSEs?
A: Increasing HDD space limit for event queue is not a 100% solution for all customers that experience the slow NSE processing. For most customers, the default 500MB is more than enough to handle thousands of clients. However, you may still be able to leverage the MaxFileQSize setting for huge environments where there are tens of thousands clients could send messages simultaneously. And, though the server can process these messages, it could be limited by this setting to receive and store them. Here are two examples.
Example 1: If the NS is an 8 GB / 2 CPU computer handling 5,000 clients without many agents on them, then the 10k clients that are posting can cause trouble if they start to proceed some tasks and software delivery at the same time. The server would handle their response fine, but it would take time. In this case it could help to increase the queue limit.
Example 2: If the NS is an 8 GB / 2 CPU computer handling 20,000 clients and you setup ITMS and all possible agents, the server might not able to proceed due to the load. In this case, increasing the setting will not help because no matter how much you increase it the clients will send more data than the server can handle.
NSE Dispatch Flow
NSE Incoming Flow
Subscribing will provide email updates when this Article is updated. Login is required.
Thanks for your feedback. Let us know if you have additional comments below. (requires login)
This will clear the history and restart the chat.