Automation has changed significantly since Deployment Solution 6.9 and, though there are elements that are the same, troubleshooting has, of necessity, also changed. What is the new flow of information and what tools do we have to troubleshoot this?
This article will cover the following topics:
- A Process Flow discussion for how a basic imaging task might work and where there could be breakdowns (subject to change as the product changes and support learns more)
- Additional troubleshooting considerations while in automation.
- Brief overview of the most significant changes from 6.9 to 7.1.
1. Process Flow: Sending a client into and out of Automation, including Task and DS communications.
For the purposes of this article, we will begin with the following assumptions about the automation environment:
- The computer has checked-in to the Notification Server.
- Deployment Task Handlers have been distributed to the agent.
- PXE is being used, rather than Automation Folders.
- Client gets agent installed and fully registered with the NS. This includes the Client Task Agent (CTA) (which is installed by default with the core NS agent) and Deployment Task Handlers (DTH) - a policy which must be enabled in the console. No automation folders are used because we're using PXE/SBS, per the assumptions above.
- Client requests and receives a Client Task Server (CTS) from the NS to check-in to. This process is discussed elsewhere, and there are some parallels you will see while in automation, but this part will NOT be discussed here in this document. If it has a live Task Server, you're good to go. If not, troubleshoot task.
- Job is created to do something in Automation, such as capture an image. This job is created in the console and then stored directly in the database by the Altiris Service. No troubleshooting on this step will be covered.
- Job is sent to the Task Server via the Object Host Service (AtrsHost), part of the CTS sub-agent, and then a "tickle" is sent to the Client over port 50124 in near real-time via AtrsHost on the CTS requesting the CTA check-in to get a task. If this "tickle" doesn't work, the CTA is designed to check in to AtrsHost every 30 minutes for work to do, and will get the job within 30 minutes (30 minutes is a change from earlier releases that had a 5 minute check-in, including v6). This can be sped up if necessary and troubleshooting for this is also contained in another KB.
- The CTA receives the job, analyzes it, sees it's destined for Deployment, and hands it off to the DTH on the client.
- The DTH reads the work to do which we will assume is a Reboot To Automation task. IF there is an automation folder installed, it will flip the bits necessary in the BootLoader to make it boot from the Automation Folder. If not, then it will simply reboot and pray for a PXE response. DTH will also report back up to the CTS (Not sure how - maybe through the CTA?) that it has received the task and is executing it.
a) At this point, there is some confusion about reporting. Once the bits are flipped, we are not quite sure if the status of the reboot task is reported or not. Some testing points out that once flipped it reports up successful, which can be confusing if it then fails to restart and/or has another task pending. Other evidence suggests this status isn't returned until restart. More research is necessary to actually pin down what is happening.
- At this point, all the agents exit and the OS exits as well. Note, that in the console, you may not have any indication of this restart, other than that the task was received. MOST of our evidence suggests that you will still see the reboot pending until it has actually finished. This differs significantly from DS 6.x.
- The computer will PXE boot if that option is enabled in the BIOS. Here, the process is very similar to DS 6.x. First, your PXE servers must be functional (see HOWTO21623). Second, your computer must be able to receive a PXE and DHCP response via the broadcast packet - meaning that you either have to have your PXE server on the same subnet, use Forced Mode with DHCP, or have IP helpers (the latter two are not supported by Altiris directly, though there are KB's to assist you in this process). Obviously, if you're booting from the folders or via Automation CD, this is moot.
- Once the PXE server sees the client system booting, it will issue an appropriate menu. If this is an unknown computer, AND the option to respond to unknown computers is enabled, it will deliver options for Initial deployment, which is not discussed in detail here. This is a known computer though, and it will receive a known boot option.
a) It should be noted here how the SBS/PXE server knows about the system, partially discussed here: HOWTO21720. The NSiSignal service collects all computers from the NS when it launches and hands this to the Server service. The Interface service though is critical here, because it will be triggered via the same job to be aware of a system booting in it's environment and how to respond. It is interfacing directly with a Web service on the NS and hands this job information off to the SBS Server service. Thus, even a new computer not picked up by NSiSignal becomes "known" to the Server service.
b) As for which PXE servers are notified - this is chosen based on site information, and some research is still needed here. Essentially though, we look at the client task server, and the sites/subnets applied to that task server, which is obviously where the client system is connected, and send this information to all PXE servers in that "area" / site. What has not been tested yet though is what happens if you have PXE servers without Task services. This process was designed with the intention of having PXE services on all Task Server Site Servers, so there'd be a 1:1 match. If you configure it differently, we're confident it will not work correctly.
- The computer will then download the appropriate menu via the MTFTP service and load it's preboot OS via RAMDrive. Though this can be Linux or WinPE, only the details of WinPE will be discussed here.
- WinPE will then load, and the first thing it does after loading drivers is to launch the Altiris Agent.
- The Altiris Agent launches, creates a log file in the x:\program files\altiris\altiris agent\logs folder, and tries to contact the NS. NOTE: Any settings in production are lost here. This is, for all intents and purposes, a completely new installation of the agent.
- The Agent will use PostEvent.asp to contact the NS, and will send up basic information to the NS. The NS will recognize the computer based on the "hash" the client sends up (which can be seen in the agent logs) and will return the known "guid" for this client. In the event this is a new computer, then a new record would be created in the NS Database, an new GUID assigned, and that sent back to the agent. This "new computer" is always called "MiniNTxxxxxxx" per the random naming convention of Windows PE. However, for this discussion, the computer is known, and no new name will be assigned.
a) Obviously, if the Network drivers did not load or bind correctly, this will fail. The only symptom you will receive of this is no activity at all, which can be caused by various things. The WinPE session will load, the command prompt minimize, and then it'll just sit there. If you get no activity at all after several minutes, it's time to troubleshoot Network with things like Ping or other tools.
- The SMP Agent then attempts to launch it's sub-agent, called the PECTAgent, or Windows PE Client Task Agent. It should be noted here that, at this point, all Agent logging generally stops. The agent has done it's job and has nothing else to do. IF you attempt to manually do other things, then the logs will grow, but normally the logs will reach this point, launch the pectagent, and then stop. This is working as designed.
a) If you look at the agent log at this point, you will either see a successful launch of the PECTAgent, or an error launching it. The Agent logs are in the normal place expected, but on the X drive instead of C. Failures of the PECTAgent to launch can be caused by several things, including no network drivers, problems with network drivers, and possibly issues with the agent (that are still being researched).
b) Note: while in automation, all messages sent to log files (i.e. Agent.log and/or PECTAgent.log) are also sent to a port on the wire. There is a RemoteTrace tool (KB?????) that will listen on this port for these messages on the server and capture them in real-time. This same tool also listens for messages from PXE services, and a few other things which we're still researching. This allows the administrator to "see" what is going on at the client without being physically present, BUT, obviously requires functioning network connectivity in WinPE.
- The PECTAgent will launch, then create a PECTAgent.log file in the same folder as the Agent.log file, and attempt to contact it's task server. It does this by first contacting the NS to request a TaskServer, and then directly attaching to the Task Server.
a) Normally, because the computer has not been physically "moved" between the shut-down and restart, the client will get the same Task Server list in Automation as it received in production, and thus will attach to the same task server. It is possible that if the system is moved, or if there are too many Task Servers, it could connect to the wrong one. This has not been tested by support and we're not sure what would happen, with some concern that the job for automation might not be present on the "new" task server and thus it would have no scheduled work-to-do.
- The PECTAgent will then register with it's CTS's AtrsHost service. It should in theory now report up that it has completed the last task (The reboot task from before) if it has not done so already. Again, we need more research in this area.
- The AtrsHost service will recognize the client, and determine that there is more work to do, and tickle the CTA or PECTAgent to check in and get it's work. For example, it might now hand the agent a script task for partitioning, or a ghost task for imaging. Either way, the PECTAgent will process these tasks in a very similar way to what it did in production.
a) What is not yet known is if it must download or run additional sub-agents. For instance, does it have to download the DTH subagent for imaging, or is this perhaps already in the WinPE image (far more efficient if so)? What about for other types of tasks? Can you send it SWD tasks or will they fail? Default tasks should work because they would be run by the PECTAgent just like the CTA in production. But what about inventory or SWD or other tasks? This has not been tested yet by Support.
b) **NOTE: We've received reports that the PECT agent is checking in every 5 minutes instead of following the default policy of 30 minutes. More research needs to be done on this issue. If you are seeing this symptom, please contact support.
- Assuming the PECTAgent does all the work, this agent will then execute the task. At this point, the PECTAgent sends a note back to the CTS that it's doing it's job, and in the console, if you refresh, you'll see that it has started the next task.
- When the PECTAgent is done, it will again report up that it has finished, and request more work. The Console will now show that task as complete and give it whatever other work is listed. We'll assume the next job is a Reboot to Production task.
- The PECTAgent will receive this task then directly from the AtrsHost service just as it has the other tasks, and report back that it has received it.
- The PECTAgent will then issue a Restart command to WinPE. IF this had all happened via Automation Folders, it would then also flip the bit to change how the computer boots. If via PXE, it will simply reboot. Because it's shutting down, no status update will be sent to the NS on the shutdown task, though some evidence suggest that it actually attempts to send this data up - possibly only when using Automation folders? We're still not sure on this.
- The computer will boot to PXE, and the PXE server will know there's no work to perform and send the client to production.
- The computer will boot in production, launch the NS Agent, check in, connect to the Task Server, and report up the completion of the previous task and ask for any more work to do. You may then have a post configuration task, or anything else, but we'll not discuss that in this document.
2. Additional Troubleshooting Considerations.
Once in WinPE, there are a few things that can be done. The command prompt left minimized at the bottom of the screen will allow you to browse logs, edit files, restart agents, look at running processes, etc. Here are a few things:
- Running Task Manager. This can be useful to see if the agent processes are running, or other processes, even custom processes, are waiting. To do this, simply type taskmgr in the command prompt.
- Viewing Logs. The main logs are in x:\program files\altiris\altiris agent\logs. There will be an Agent.log and pectagent.log if all goes well. The easiest way to view them is with Notepad. For instance, if you change to that folder, you might type notepad pectagent.log to review what is in there.
- Modifying and re-trying the PECTAgent. There is a pectagent.ini file you can check in notepad, and you can verify the server it's pointing to (i.e. maybe it needs to be changed?) as well as what port it's working on. After making a change there, you can manually execute the agent to see if it'll connect and get the job.
3. Additional Possibilities / Functionality
While in automation, remember this is a fully-fleshed-out WinPE environment, and you can do a lot of things, such as those mentioned in the above section, custom scripts, map drives and execute things remotely, etc. Be creative.
4. Differences to be aware of from DS 6.x
For those upgrading from DS in an earlier version (pre 7.x), here is a short list of some key differences that could cause confusion. Please become familiar with these, and the new process at the top, to help adjust to the new environment:
- One of the key differences occurs in WinPE and what you don't see. When the agent loads, it either works, or just sits and does nothing, and you may not see any errors at all, though some do show. The biggest confusion is how long to wait before something happens, such as the Ghost task running. Some things to expect:
- You will not see any agent loading.
- You may not see erorrs if the NIC fails to bind (you might, depending on where it fails.
- If there's nothing to do (no task assigned or the computer was not recognized properly), you will see nothing except a minimized prompt - no errors.
- Jobs and tasks can be confusing compared to DS 6.9. They are more flexible now, but unlike the single-click capture jobs of DS 6.9, we now require a component-by-component build of the tasks you want done. For instance, in DS 6.9 you might select a capture image task, and in that task was all the components you needed, including Sysprep, what automation to run in, etc. In DS 7.x, you have to build a job and add a task for each step of the process, including Sysprep, reboot to automation, capture image, reboot to production, etc..
- Reboot tasks. DS 6.9 detected what envirionment you were in and would reboot for you if you were in the "wrong place." In DS 7.1, you have to make the call on when to reboot, meaning you'll have at least 2 reboot tasks in many jobs - one into automation, one out of it. If you forget these, things like imaging tasks will fail because they'll be attempted in production, or possibly in automation where they don't belong.
- Task Server and NS Agent communication instead of AClient/DAgent and DS Engine. The differences can be surprising and include:
- client policies take 4 hrs to initiate instead of the "instant" processes in DS (tasks though are immediate, or close).
- Services are all different, along with many of the logs (there's actually better/more logging in NS)
- Agent names - no more AClient and DAgent. Now it's just the Altiris Agent.
- Port changes (no more port 402
- PXE setup and configuration. There is no longer a PXE configuration utility to control PXE. Now it's all done via policies in NS. Furthermore, the PXE configurations don't go out to the PXE servers in real-time like before.