Quantcast
Channel: Mike Lagase
Viewing all 60 articles
Browse latest View live

Performance Counter Collection Tools for Exchange


New Windows Dynamic Cache Service for 64-bit Windows 2003 servers

$
0
0

If you've ever had an issue where low memory conditions were causing working set trimming issues due to excessive use of the System File Cache, then we have just released a new service that can be used to help alleviate this issue called Microsoft Windows Dynamic Cache Service.

More information regarding this new service can be found here and a direct link to download this new service can be found here

With Exchange 2007 servers also running in to these issues which I blogged about here, this service could potentially allow other 3rd party services to play nice with Exchange 2007 which may be consuming more than it's fair share of the System File Cache.

So if you find that Exchange performance is suffering because of some other service taking up overall memory in the System File Cache, then this service may be just for you.

Hope this helps with some of your performance related issues.

Mike

Client RPC Dialog box questionnaire for Administrators

$
0
0

There are times when you are troubleshooting an Exchange Server issue where it appears that the server is performing OK, but the users are still complaining of the dreaded RPC dialog box and hangs in their client. Most of the time an Exchange administrator or helpdesk personnel needs to speak directly with the end user to determine what actions they were taking at the time the RPC dialog box occurs. Since there are numerous ways which can promote this dialog box, an administrator needs to understand specific actions that users were taking at the time of the problem. A lot of the times, this has nothing to do with server side performance problems, but rather something that is installed on the client or something the user is doing.

I have created a simple document in which the users can answer to allow you to gain some insight in to a users actions and their habits that are aggravating this RPC dialog box.

The document is password protected so that the fields are checkable. The password currently is "Microsoft".

Please provide feedback regarding this document to help make this better.

Thanks,

Mike

Outlook 2007 Performance Improvements Hotfix

$
0
0

If you haven't heard already, we have released a Pre-SP2 hotfix that help improve Outlook performance and responsiveness in a big way. Here is an excerpt from the article.

Performance improvements

Performance and responsiveness are key concerns for all our customers. That is why we made the large performance tuning and optimization changes that are included in Office suite Service Pack 2 (SP2).

Outlook 2007 SP2 delivers performance improvements in four major areas:

  • General Responsiveness
    SP2 reduces I/O disk usage and UI response time.
  • Startup
    SP2 removes long operations from initial startup.
  • Shutdown
    SP2 makes Outlook exit predictably despite pending activities.
  • Folder/View Switch
    SP2 improves view rendering and folder switching.

Before you go out applying this on your machine, you need to be warned of the first startup experience as we rebuild the tables and indexes in your OST. If you have a large OST, this is going to take some time, but I can tell you that the wait is well worth it. It is actually an entirely new experience at least for me anyway's since I have a good deal of email item counts in my folders. Switching between folders with large item counts is no longer painful and this hotfix provides immediate viewing of these folders.

Grab the hotfix from the following article:

Description of the Outlook 2007 hotfix package (Outlook.msp): February 24, 2009
http://support.microsoft.com/kb/961752

Check out the plethora of improvements in this release in the following article as there are many.

Outlook 2007 improvements in the February 2009 cumulative update
http://support.microsoft.com/kb/968009

Hope this helps tame some of the larger mailboxes that you have.

Mike

Windows Desktop Search and the implications on WAN performance

$
0
0

Windows Desktop search (WDS) is a great tool to help you to search through the unwieldy plethora of documents or emails that you may have scattered across your desktop. With the addition of 3rd party IFilter add-ins, it makes it even easier to find what you are looking for.

As of version 3.01, Desktop search has disabled the indexing of online mailbox on a default installation due to performance implications on the Exchange server side. Companies sometimes have the need to still run Outlook in Online mode due to security requirements of not having local OST’s, or they need real-time email for business purposes. With some of those requirements, companies have the need to also have fast message/document retrieval which Windows Desktop Search can surely do without a problem.

WDS does have some group policy settings that will now allow online indexing of mailboxes and a listing of all the settings for WDS 4 is at http://technet.microsoft.com/en-us/library/cc732491.aspx. This setting that allows indexing of online mode Outlook profiles is "Enable Indexing uncached Exchange Folders". Once this is deployed via group policy, WDS will now start indexing online mode Outlook profiles. This of course could put a huge strain on the server as all of the users data is being indexed if deployed to a large user base. Recommended guidance states that you should deploy this policy to smaller subsets of users to prevent possible server performance problems. This is similar guidance to what Microsoft recommends for cached mode deployments.

With that said, I would like to now take us down a road where certain combinations of WDS policies can not only affect Exchange server side performance, but can also have serious implications on WAN performance. If you currently have a centralized Exchange deployment and users are accessing all of their email across WAN circuits, read on.

Let’s say you have an administrative assistant running in cached mode that needs to gain access to another users complete mailbox with a requirement that data in that mailbox is easily discoverable. This requirement can be easily met by using Windows Desktop Search and is very common in law firms. A default Outlook 2007 installation will have the "Download shared folders (exclude mail folders)" option selected for their user profile, so if this assistant had previously opened a another users non-email folder such as Contacts, Calendar and Tasks, WDS will index those items without any issue. This feature unfortunately does not meet the complete requirement as we need to index all items in the other users mailbox. After Full mailbox permissions is added for this assistant, they can now add this other users mailbox to their profile to view their data.  Once you do this, you will now see that WDS will still not find any email items unless you have selected the folder in the mailbox and then performing the search. Everything so far is the default behavior.

WDS has a feature which will allow you to index online delegate mailboxes and is deployed via the GPO setting "Enable Indexing of online delegate Mailboxes".  Once this setting is deployed, any user that has another users mail related folder in their profile will now get indexed. So that seems like a good thing, no? Well, we all know that indexing any type of mailbox in online mode will increase overall performance on the Exchange server and if users are doing this over a WAN, you will now see increased WAN utilization while WDS is indexing this data making direct RPC calls to the Exchange server. If this setting was deployed to a large user base while there are a number of profiles that have other mailboxes added to their profile, you could potentially saturate this network circuit. Your network administrator at this point would obviously not be too happy and your users would then start complaining that email access is really slow or Outlook may get disconnected due to this saturation. Our best practices dictate that this setting should be deployed to smaller user bases at a time to prevent increased client traffic to the Exchange Server.

Imagine deploying this policy to 1000 users all accessing Exchange across a WAN and all have an added mailbox to their profile. By default, WDS will only index 120 items per minute which should help keep the Indexing traffic under control. If all users workstations were indexing this amount of data at a time, we would be seeing about 120,000 items per minute of traffic.  Couple that with any attachments that WDS is configured to index for such as PDF or Word documents, and this will make for a very bad network day.

There are ways to change the amount of items that are indexed per minute by modifying the GPO setting "Enable Throttling Online Mailboxes". Setting this policy to a lower value will help reduce the amount of items that are indexed per minute per mailbox and should also help keep some of the network traffic down to a minimum. The caveat here is that it will take longer to index these mailboxes. Keep in mind that is still going to be direct RPC traffic to the Exchange server with minimal amount of throttling.

To help reduce some of this overhead, Outlook 2007 has a registry entry ( CacheOthersMail ) that will allow you to cache other users mail folders in an OST file. This was first introduced in KB955572 and requires that you disable the downloading of headers. This was then rolled up in to the Outlook 2007 post SP1 Sept. 24, 2008 hotfix package (957909) . If the indexing of delegate mailboxes policy has been deployed to these users and you add this Outlook registry key, you will now see a mixture of traffic being generated by WDS. One is direct RPC traffic to the Exchange server and the other is Outlook FxGetBuffer function calls or otherwise known as Outlook sync (ICS). The Outlook sync traffic will become more prevalent over time as the other users mailbox is cached locally in the OST file. FxGetBuffer calls are a lot less expensive than direct RPC calls to the Exchange server, so deploying the CacheOthersMail registry key may help with overall WAN utilization during initial indexing. You still need to plan on increased WAN traffic as synch traffic coming from many clients could also cause potential WAN degradation issues.

WDS Registry Reference

Registry data to index data in your mailbox if you have an Online mode profile
Key: HKLM\software\policies\Microsoft\windows\windows search
DWORD: PreventIndexingUncachedExchangeFolders
Value: 0

Registry data to index shared mailboxes:
Key: HKLM\software\policies\Microsoft\windows\windows search
DWORD: EnableIndexingDelegateMailboxes
Value: 1

Registry data to change the amount of mail items that are indexed per minute.
Key: HKLM\software\policies\Microsoft\windows\windows search
DWORD: EnableThrottlingOnlineMailboxes
Value: 120
Accepted Values (Default: 120, Min: 6, Max:600)

Outlook Registry Reference

Registry data to Cache others users mail data in an OST
One-off users
Key: HKCU\Software\Microsoft\Office\12.0\Outlook\Cached Mode
DWORD: CacheOthersMail
Value: 1

GPO deployed
Key: HKCU\Software\Policies\Microsoft\Office\12.0\Outlook\Cached Mode
DWORD: CacheOthersMail
Value: 1

One of the most taxing combinations of WDS settings with relationship to Exchange Server and WAN performance is deploying "Enable Indexing uncached Exchange Folders" and "Enable Indexing of online delegate Mailboxes" simultaneously. If you also index attachments which is the default behavior, this could put increased burden on network resources and could cause considerable downtime for your users. Deploying these settings needs to be carefully planned out especially in centralized Exchange installations to prevent the situations that I describe.

Troubleshooting Exchange 2007 Store Log/Database growth issues

$
0
0

One of the common issues we see in support is excessive Database and/or Transaction log growth problems. If you have ever run in to one of these issues, you will find that they are not always easy to troubleshoot as there are many tools that are needed to help understand where the problem might be coming from. Customers have asked why does the Server allow these type of operations to occur in the first place and why is the Exchange Server not resilient to this? That is not always an easy question to answer as there as so many variables as to why this may occur in the first place ranging from faulty Outlook Add-ins, Custom or 3rd party applications, corrupted rules, corrupted messages, online maintenance not running long enough to properly maintain your database, and the list goes on and on.

Once an Outlook client has created a profile to the Exchange server, they pretty much have full reign to do whatever actions they want within that MAPI profile. This of course, will be controlled mostly by your Organizations mailbox and message size limits and some of the Client throttling or backoff features that are new to Exchange 2007.

Since I have dealt with these type problems in great detail, I thought it would be helpful to share some troubleshooting steps with you that may help you collect, detect and mitigate these problems when and if you should see them.

General Troubleshooting

Exchange 2007 SP2 RU2 and Later

  • Exchange 2007 SP2 RU2 adds a new feature to help track these log growth issues much easier. All you have to do is to set some thresholds (warning/error) in the registry and then once the log growth problem starts occurring, you can simply view the application log for events where a user has crossed over the thresholds that you have set. Note: This is not set by default after installing SP2 RU2, so if you are in the middle of log growth issue, adding the appropriate registry keys to the server will help provide additional insight in to the problem. See http://support.microsoft.com/kb/972705 for more information on this new feature and how to determine what values to set these registry keys to.

Outlook 2007

  • A new Outlook 2007 fix has been created to allow any email with an attachment being sent via MAPISendMail to now honor message size limits. Below is a brief description of how MAPISendMail can affect log growth on an Exchange server. 
    • When using the Send To Mail Recipient facility in Windows using an online Outlook mode client to send a message/attachment over the max message size limit, Outlook will stream the data to the store prior to performing any message size limit checking, thus creating log files for the amount of data that the attachment size is for. Once the outlook message comes up, the damage is already done on the Exchange server. If you add a recipient to the message and try to send the email, you will then receive an error "The messaging interface has returned an unknown error. If the problem persists, restart Outlook". If you then save the message in the mailbox it will be successful. If you then pull up the message and then send it, you will now get the error "The message being sent exceeds the message size established for this user".
    • Now if you attempt the same process using a cached mode client, Outlook will open a new message with the attachment without any limit checks. If you add a recipient and then send the message, it will sit in the users Outbox. Performing a send/receive on the client will now generate the error "Task 'Microsoft Exchange - Sending' reported error (0x80040610): 'The message being sent exceeds the message size established for this user.'" This is expected behavior. If the user goes in now and deletes the message, the message ends up in the users deleted items folder which is then synched to the server. Messages that are over the size limit that are either imported or saved in to a user’s mailbox does not honor overall message size limits during the sync process.
    • If you use the Send to Mail option in any Office program, you will receive the same results where if in online mode, we stream the data to the server prior to checking size limits. Cached mode reacts the same way as well.

      To resolve this issue for your Outlook 2007 users, install 978401 on every client machine.

Builds earlier than Exchange 2007 SP2 RU2

  1. Use Exchange User Monitor (Exmon) server side to determine if a specific user is causing the log growth problems.

    • Sort on CPU (%) and look at the top 5 users that are consuming the most amount of CPU inside the Store process. Check the Log Bytes column to verify for this log growth for a potential user.
    • If that does not show a possible user, sort on the Log Bytes column to look for any possible users that could be attributing to the log growth
    • If it appears that the user in Exmon is a ?, then this is representative of a HUB/Transport related problem generating the logs. Query the message tracking logs using the Message Tracking Log tool in the Exchange Management Consoles Toolbox to check for any large messages that might be running through the system. See step 5.9 for a Powershell script to accomplish the same task.
  2. If suspected user is found via Exmon, then do one of the following:

    1. Disable MAPI access to the users mailbox using the following steps (Recommended):

      • Run Set-Casmailbox –Identity <Username> –MapiEnabled $False

      • Move the mailbox to another Mailbox Store. Note: This is necessary to disconnect the user from the store due to the Store Mailbox and DSAccess caches. Otherwise you could potentially be waiting for over 2 hours and 15 minutes for this setting to take effect. Moving the mailbox effectively kills the users MAPI session to the server and after the move, the users access to the store via a MAPI enabled client will be disabled.

    2. Disable the users AD account temporarily

    3. Kill their TCP connection with TCPView

    4. Call the client to have them close Outlook in the condition state for immediate relief.

  3. If closing the client down or killing their sessions seems to stop the log growth issue, then we need to do the following to see if this is OST or Outlook profile related:

    1. Have the user launch Outlook while holding down the control key which will prompt if you would like to run Outlook in safe mode. If launching Outlook in safe mode resolves the log growth issue, then concentrate on what add-ins could be attributing to this problem.

    2. If you can gain access to the users machine, then do one of the following:

      1. Launch Outlook to confirm the log file growth issue on the server.

      2. If log growth is confirmed, do one of the following

        1. Check users Outbox for any messages.

          1. If user is running in Cached mode, set the Outlook client to Work Offline. Doing this will help stop the message being sent in the outbox and sometimes causes the message to NDR.

          2. If user is running in Online Mode, then try moving the message to another folder to prevent Outlook or the HUB server from processing the message.

          3. After each one of the steps above, check the Exchange server to see if log growth has ceased

        2. Call Microsoft Product Support to enable debug logging of the Outlook client to determine possible root cause.

      3. Follow the Running Process Explorer instructions in the below article to dump out dlls that are running within the Outlook Process. Name the file username.txt. This helps check for any 3rd party Outlook Add-ins that may be causing the excessive log growth.

        970920  Using Process Explorer to List dlls Running Under the Outlook.exe Process
        http://support.microsoft.com/kb/970920

      4. Check the Sync Issues folder for any errors that might be occurring

    3. Let’s attempt to narrow this down further to see if the problem is truly in the OST or something possibly Outlook Profile related:

      1. Run ScanPST against the users OST file to check for possible corruption.

      2. With the Outlook client shut down, rename the users OST file to something else and then launch Outlook to recreate a new OST file. If the problem does not occur, we know the problem is within the OST itself.

      3. If renaming the OST causes the problem to recur again, then recreate the users profile to see if this might be profile related.

  4. Ask Questions:

    1. Is the user using any type of mobile device?

    2. Question the end user if at all possible to understand what they might have been doing at the time the problem started occurring. It’s possible that a user imported a lot of data from a PST file which could cause log growth server side or there was some other erratic behavior that they were seeing based on a user action.

  5. If Exmon does not provide the data that is necessary to get root cause, then do the following:

    1. Check current queues against all HUB Transport Servers for stuck or queued messages

      get-exchangeserver | where {$_.IsHubTransportServer -eq "true"} | Get-Queue | where {$_.Deliverytype –eq “MapiDelivery”} | Select-Object Identity, NextHopDomain, Status, MessageCount | export-csv  HubQueues.csv

      Review queues for any that are in retry or have a lot of messages queued.

      Export out message sizes in MB in all Hub Transport queues to see if any large messages are being sent through the queues.

      get-exchangeserver | where {$_.ishubtransportserver -eq "true"} | get-message –resultsize unlimited | Select-Object Identity,Subject,status,LastError,RetryCount,queue,@{Name="Message Size MB";expression={$_.size.toMB()}} | sort-object -property size –descending | export-csv HubMessages.csv  

      Export out message sizes in Bytes in all Hub Transport queues.

      get-exchangeserver | where {$_.ishubtransportserver -eq "true"} | get-message –resultsize unlimited | Select-Object Identity,Subject,status,LastError,RetryCount,queue,size | sort-object -property size –descending | export-csv HubMessages.csv

    2. Check Users Outbox for any large, looping, or stranded messages that might be affecting overall Log Growth.

      get-mailbox -ResultSize Unlimited| Get-MailboxFolderStatistics -folderscope Outbox | Sort-Object Foldersize -Descending | select-object identity,name,foldertype,itemsinfolder,@{Name="FolderSize MB";expression={$_.folderSize.toMB()}} | export-csv OutboxItems.csv

      Note: This does not get information for users that are running in cached mode.

    3. Utilize the MSExchangeIS Client\Jet Log Record Bytes/sec and MSExchangeIS Client\RPC Operations/sec Perfmon counters to see if there is a particular client protocol that may be generating excessive logs. If a particular protocol mechanism if found to be higher than other protocols for a sustained period of time, then possibly shut down the service hosting the protocol. For example, if Exchange Outlook Web Access is the protocol generating potential log growth, then stopping the World Wide Web Service (W3SVC) to confirm that log growth stops. If log growth stops, then collecting IIS logs from the CAS/MBX Exchange servers involved will help provide insight in to what action the user was performing that was causing this occur.

    4. Run the following command from the Management shell to export out current user operation rates:

      To export to CSV File:

      get-logonstatistics |select-object username,Windows2000account,identity,messagingoperationcount,otheroperationcount,progressoperationcount,streamoperationcount,tableoperationcount,totaloperationcount | where {$_.totaloperationcount -gt 1000} | sort-object totaloperationcount -descending| export-csv LogonStats.csv

      To view realtime data:

      get-logonstatistics |select-object username,Windows2000account,identity,messagingoperationcount,otheroperationcount,progressoperationcount,streamoperationcount,tableoperationcount,totaloperationcount | where {$_.totaloperationcount -gt 1000} | sort-object totaloperationcount -descending| ft

      Key things to look for:
      In the below example, the Administrator account was storming the testuser account with email.
      You will notice that there are 2 users that are active here, one is the Administrator submitting all of the messages and then you will notice that the Windows2000Account references a HUB server referencing an Identity of testuser. The HUB server also has *no* UserName either, so that is a giveaway right there. This can give you a better understanding of what parties are involved in these high rates of operations

      UserName : Administrator
      Windows2000Account : DOMAIN\Administrator
      Identity : /o=First Organization/ou=First Administrative Group/cn=Recipients/cn=Administrator
      MessagingOperationCount : 1724
      OtherOperationCount : 384
      ProgressOperationCount : 0
      StreamOperationCount : 0
      TableOperationCount : 576
      TotalOperationCount : 2684

      UserName :
      Windows2000Account : DOMAIN\E12-HUB$
      Identity : /o= First Organization/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)/cn=Recipients/cn=testuser
      MessagingOperationCount : 630
      OtherOperationCount : 361
      ProgressOperationCount : 0
      StreamOperationCount : 0
      TableOperationCount : 0
      TotalOperationCount : 1091

    5. Enable Perfmon/Perfwiz logging on the server. Collect data through the problem times and then review for any irregular activities. You can grab some pre-canned Perfmon import files at http://blogs.technet.com/mikelag/archive/2008/05/02/perfwiz-replacement-for-exchange-2007.aspx to make collecting this data easier.

    6. Run ExTRA (Exchange Troubleshooting Assistant) via the Toolbox in the Exchange Management Console to look for any possible Functions (via FCL Logging) that may be consuming Excessive times within the store process. This needs to be launched during the problem period. http://blogs.technet.com/mikelag/archive/2008/08/21/using-extra-to-find-long-running-transactions-inside-store.aspx shows how to use FCL logging only, but it would be best to include Perfmon, Exmon, and FCL logging via this tool to capture the most amount of data.

    7. Dump the store process during the time of the log growth. (Use this as a last measure once all prior activities have been exhausted and prior to calling Microsoft for assistance. These issues are sometimes intermittent, and the quicker you can obtain any data from the server, the better as this will help provide Microsoft with information on what the underlying cause might be.)

      1. Download Procdump 3.0 or greater from http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx and extract it to a directory on the Exchange server

      2. Open the command prompt and change in to the directory which procdump was extracted in step A.

      3. Type procdump -mp -s 120 -n 2 store.exe d:\DebugData. This will dump the data to D:\DebugData. Change this to whatever directory has enough space to dump the entire store.exe process twice. Check Task Manager for the store.exe process and how much memory it is currently consuming for a rough estimate of the amount of space that is needed to dump the entire store dump process.

        Important: If procdump is being run against a store that is on a clustered server, then you need to make sure that you set the Exchange Information Store resource to not affect the group. If the entire store dump cannot be written out in 300 seconds, the cluster service will kill the store service ruining any chances of collecting the appropriate data on the server.

      4. Open a case with Microsoft Product Support Services to get this data looked at.

    8. Collect a portion of Store transaction log files (100 would be good) during the problem period and parse them following the directions in http://blogs.msdn.com/scottos/archive/2007/11/07/remix-using-powershell-to-parse-ese-transaction-logs.aspx to look for possible patterns such as high pattern counts for IPM.Appointment. This will give you a high level overview if something is looping or a high rate of messages being sent. Note: This tool may or may not provide any benefit depending on the data that is stored in the log files, but sometimes will show data that is MIME encoded that will help with your investigation

    9. Export out Message tracking log data from affected MBX server

      Method 1
      Download the attached ExLogGrowthCollector.zip file to this post and extract to the MBX server that experienced the issue. Run ExLogGrowthCollector.ps1 from the Exchange Management Shell. Enter in the MBX server name that you would like to trace, the Start and End times and click on the Collect Logs button.

      image

      Note: What this script does is to export out all mail traffic to/from the specified mailbox server across all HUB servers between the times specified. This helps provide insight in to any large or looping messages that might have been sent that could have caused the log growth issue.

      Method 2
      Copy/Paste the following data in to notepad, save as msgtrackexport.ps1 and then run this on the affected Mailbox Server. Open in Excel for review. This is similar to the GUI version, but requires manual editing to get it to work.

      #Export Tracking Log data from affected server specifying Start/End Times

      Write-host "Script to export out Mailbox Tracking Log Information"
      Write-Host "#####################################################"
      Write-Host
      $server = Read-Host "Enter Mailbox server Name"
      $start = Read-host "Enter start date and time in the format of MM/DD/YYYY hh:mmAM"
      $end = Read-host "Enter send date and time in the format of MM/DD/YYYY hh:mmPM"
      $fqdn = $(get-exchangeserver $server).fqdn
      Write-Host "Writing data out to csv file..... "
      Get-ExchangeServer | where {$_.IsHubTransportServer -eq "True" -or $_.name -eq "$server"} | Get-MessageTrackingLog -ResultSize Unlimited -Start $start -End $end  | where {$_.ServerHostname -eq $server -or $_.clienthostname -eq $server -or $_.clienthostname -eq $fqdn} | sort-object totalbytes -Descending | export-csv MsgTrack.csv -NoType
      Write-Host "Completed!! You can now open the MsgTrack.csv file in Excel for review"


      Method 3
      You can also use the Process Tracking Log Tool at http://msexchangeteam.com/archive/2008/02/07/448082.aspx to provide some very useful reports.

    10. Save off a copy of the application/system logs from the affected server and review them for any events that could attribute to this problem

    11. Enable IIS extended logging for CAS and MB server roles to add the sc-bytes and cs-bytes fields to track large messages being sent via IIS protocols and to also track usage patterns.

Proactive monitoring and mitigation efforts

  1. In backup-less environments, if the “Do not permanently delete mailboxes and items until the store has been backed up” setting is checked on an Exchange 2003 database or the RetainDeletedItemsUntilBackup parameter is set to $true on an Exchange 2007 database , then this setting over time could lead to consistent steady store growth since all whitespace in the database is being consumed/reused. Even with online maintenance running on the server, these pages in the database are never reclaimed to free up any space on them due to this flag being set.
  2. Check whether online maintenance for the database in question has been running nightly in the application log.
  3. Check whether any move mailbox operations are occurring that might be moving users to this database exhibiting the log growth issue.
  4. Increase Diagnostics Logging for the following objects depending on what stores are being affected:

    • MSExchangeIS\Mailbox\Rules
    • MSExchangeIS\PublicFolders\Rules
  5. Enable Client Side monitoring per http://technet.microsoft.com/en-us/library/cc540465.aspx
  6. Create a monitoring plan using MOM/SCOM to alert when the amount of Log Bytes being written hit a specific threshold and then alert the messaging team for further action. There are thresholds that are a part of the Exchange 2007 Management Pack that could help alert to these type situations before the problem gets to a point of taking a database offline. Here are 2 examples of this.

    ESE Log Byte Write/sec MOM threshold
    Warning Event
    http://technet.microsoft.com/en-us/library/bb218522.aspx

    Error Event
    http://technet.microsoft.com/en-us/library/bb218733.aspx

    If an alert is raised, then perform an operation to start collecting data.
  7. Ensure http://support.microsoft.com/kb/958701 is installed at a minimum for each Outlook 2003 client to address known log/database growth issues for users streaming data to the information store that have exceeded message size limits. This fix also addresses a problem where clients could copy a message to their inbox from a PST that during the sync process could exceed mailbox limits, thus causing excessive log growth problems on the server.

    These hotfixes make use of the PR_PROHIBIT_SEND_QUOTA and PR_MAX_SUBMIT_MESSAGE_SIZE  which is referenced in http://support.microsoft.com/kb/894795

    Additional Outlook Log Growth fixes:
    http://support.microsoft.com/kb/957142
    http://support.microsoft.com/kb/936184

  8. Implement minimum Outlook Client versions that can connect to the Exchange server via the Disable MAPI clients registry key server side. See http://technet.microsoft.com/en-us/library/bb266970.aspx for more information.

    To disable clients less than Outlook 2003 SP2, use the following entries on an Exchange 2007 server
    "-5.9.9;7.0.0-11.6568.6567"

    Setting this to exclude Outlook client versions less than Outlook 2003 SP2 will help protect against stream issues to the store. Reason being is that Outlook 2003 SP2 and later understand the new quota properties that were introduced in to the store in http://support.microsoft.com/kb/894795. Older clients have no idea what these new properties are, so if a user sent a 600MB attachment on a message, it would stream the entire message to the store generating excessive log files and then get NDR’ed once the message size limits were checked. With SP2 installed, the Outlook client will first check to see if the attachment size is over the set quota for the organization and immediately stop the send with a warning message on the client and prevent the stream from being sent to the server.

    Allowing any clients older than SP2 to connect to the store is leaving the Exchange servers open for a growth issue.

  9. If Entourage clients are being utilized, then implement the MaxRequestEntityAllowed property in http://support.microsoft.com/kb/935848  to address a known issue where sending a message over the size limit could potentially create log growth for a database.
  10. Check to ensure File Level Antivirus exclusions are set correctly for both files and processes per http://technet.microsoft.com/en-us/library/bb332342.aspx
  11. Enable Content Conversion tracing on all HUB servers per http://technet.microsoft.com/en-us/library/bb397226.aspx . This will help log any failed conversion attempts that may be causing the log growth problem to occur.
  12. If POP3 or IMAP4 clients are connecting to specific servers, then implementing Protocol Logging for each on the servers that may be making use of these protocols will help log data to a log file where these protocols are causing excessive log growth spurts. See http://technet.microsoft.com/en-us/library/aa997690.aspx on how to enable this logging.
  13. Ensure Online maintenance is completing a pass for each database within the past week or two. Query Application event logs for the ESE events series 700 through 704 to clarify. If log growth issues occur during online maintenance periods, this could be normal as Exchange shuffles data around in the database. We just need to ensure that we keep this part in mind during these log growth problems.
  14. Check for any excessive ExCDO warning events related to appointments in the application log on the server. (Examples are 8230 or 8264 events). http://support.microsoft.com/kb/947014 is just one example of this issue. If recurrence meeting events are found, then try to regenerate calendar data server side via a process called POOF.  See http://blogs.msdn.com/stephen_griffin/archive/2007/02/21/poof-your-calender-really.aspx for more information on what this is.

    Event Type: Warning
    Event Source: EXCDO
    Event Category: General
    Event ID: 8230
    Description: An inconsistency was detected in username@domain.com: /Calendar/<calendar item> .EML. The calendar is being repaired. If other errors occur with this calendar, please view the calendar using Microsoft Outlook Web Access. If a problem persists, please recreate the calendar or the containing mailbox.

    Event Type: Warning
    Event ID : 8264
    Category : General
    Source : EXCDO
    Type : Warning
    Message : The recurring appointment expansion in mailbox <someone's address> has taken too long. The free/busy information for this calendar may be inaccurate. This may be the result of many very old recurring appointments. To correct this, please remove them or change their start date to a more recent date.

    Important: If 8230 events are consistently seen on an Exchange server, have the user delete/recreate that appointment to remove any corruption

  15. Add additional store logging per http://support.microsoft.com/kb/254606 to add more performance counter data to be collected with Perfmon. This will allow us to utilize counters such as ImportDeleteOpRate and SaveChangesMessageOpRates which allows us to see what these common log growth rates are. 
  16. Recommend forcing end dates on recurring meetings.  This can be done through the usage of the registry key DisableRecurNoEnd (DWORD).

    For Outlook 2003:
    http://support.microsoft.com/kb/952144
    HKEY_CURRENT_USER\Software\Microsoft\Office\11.0\Outlook\Preferences

    For Outlook 2007:
    http://support.microsoft.com/kb/955449
    HKEY_CURRENT_USER\Software\Microsoft\Office\12.0\Outlook\Preferences
    Value: 1 to Enable, 0 to Disable
  17. Implement LimitEmbeddingDepth on the Exchange servers as outlined in KB 833607 to prevent log growth due to recursion looping. Note: This article states this if for Exchange 2000-2003, but the key is also still valid in Exchange 2007 per source code

Known Issues

Exchange Server

SP1 Release Update 9 fixes

  • 959559 - Transaction log files grow unexpectedly in an Exchange Server 2007 Service Pack 1 mailbox server on a computer that is running Windows Server 2008
  • 925252 - The Store.exe process uses almost 100 percent of CPU resources, and the size of the public folder store increases quickly in Exchange Server 2007
  • 961124 - Some messages are stuck in the Outbox folder or the Drafts folder on a computer that is running Exchange Server 2007 Service Pack 1
    970725 - Public folder replication messages stay in the local delivery queue and cause an Exchange Server 2007 Service Pack 1 database to grow quickly

SP1 Release Update 8 fixes

  • 960775 - You receive a "Message too large for this recipient" NDR that has the original message attached after you restrict the Maximum Message Send Size value in Exchange Server 2007

SP1 Release Update 7 fixes

  • 957124 - You do not receive an NDR message even though your meeting request cannot be sent successfully to a recipient
  • 960775 - You receive a "Message too large for this recipient" NDR that has the original message attached after you restrict the Maximum Message Send Size value in Exchange Server 2007

SP1 Release Update 1 fixes

  • 947014 - An Exchange Server 2007 mailbox server randomly generates many transaction logs in an Exchange Server 2007 Service Pack 1 environment
  • 943371 - Event IDs 8206, 8213, and 8199 are logged in an Exchange Server 2007 environment

Outlook 2007

  • 970944 – Installing this hotfix package addresses and issue where log files are generated unexpectedly when a user is running Outlook 2007 in the cached Exchange mode and sends an e-mail message to the recipients who have a corrupted e-mail address and/or e-mail address
  • 970777 - Additional log files are generated on the Exchange server unexpectedly when you send an e-mail message to recipients who have a corrupted e-mail address or a corrupted e-mail address type by using Cached Exchange mode in Outlook 2007 
  • 978401 - Description of the Office Outlook 2007 hotfix package (Outlook-x-none.msp): February 23, 2010 (Includes a MAPISendMAIL fix)

Outlook 2003

  • 958701 - Description of the Outlook 2003 Post-Service Pack 3 hotfix package (Engmui.msp, Olkintl.msp, Outlook.msp): October 28, 2008
  • 936184 - Description of the Outlook 2003 post-Service Pack 3 hotfix package: December 14, 2007
  • 897247 - Description of the Microsoft Office Outlook 2003 post-Service Pack 1 hotfix package: May 2, 2005

Entourage

  • 935848 - Various performance issues occur when you use Entourage for Mac to send large e-mail messages to an Exchange 2007 server

Windows 2008

  • 955612 - The "LCMapString" function may return incorrect mapping results for some languages in Windows Server 2008 and in Windows Vista

How to collect per request Performance Stats for IIS on Exchange 2007

$
0
0

Ever had a time where you were trying to troubleshoot an IIS Performance related issue on Exchange 2007 and the built-in performance counters were not giving you the data that you needed to gain insight in to the problem? I know I have run in to these before and they are not always the easiest to track as we cannot see latencies at a per request level easily.

As part of the default installation of Exchange 2007, you may have also seen IIS log entries similar to the following, but didn’t know what the appended IIS data meant.

/owa/ev.owa oeh=1&ns=DatePicker&ev=GetFreeBusy&m=2009-04-01T00%3a00%3a00&fId=LgAAAADBC0ggZ4mHTKllH8Mc0937AQBmBiNCEaM7R53LcWBj0I1aAAAAAACrAAAC&prfltncy=98&prfrpccnt=6&prfrpcltncy=78&prfldpcnt=0&prfldpltncy=0&prfavlcnt=0&prfavlltncy=0

The information I am calling out in this IIS Log request is prfltncy, prfrpccnt, prfrpcltncy, prfldpcnt, prfldpltncy, prfavlltncy. These entries are specific to latency entries at the end of each call that is being made. There may only be a handful of these throughout the logs by default.

Luckily, there is a way to enable additional per request user tracing in to the IIS logs to help you with troubleshooting these performance type problems. This tracing will allow you to see per request latencies for OWA, RPC and Availability requests.

To enable this additional logging, you would do the following:

Go to "Program Files\Microsoft\Exchange Server\ClientAccess\OWA". Edit web.config in Notepad. Add the following line of text under appSettings:
<add key="CollectPerRequestPerformanceStats" value="true"/>

After saving the web.config file, you should start seeing entries in the IIS logs similar to the above, but here is another log example:

/owa/default.aspx modurl=7&prfltncy=84212&prfrpccnt=37&prfrpcltncy=84011&prfldpcnt=9&prfldpltncy=30&prfavlcnt=0&prfavlltncy=0

In the above request, we can see that the RPC latencies are high (prfltncy=84212&prfrpccnt=37&prfrpcltncy=84011) , so this was most likely a bottleneck between the CAS and the backend Mailbox server. Now wasn’t that easy to determine where the potential bottleneck might lie?

Per Request Tracing Legend
Prfltncy - Overall Performance Latencies for this request
Prfrpccnt - RPC request count
Prfrpcltncy - RPC Latencies
Prfldpcnt - LDAP request count
Prfldpltncy – LDAP Latencies
Prfavlltncy - Availability Latencies

If you break one of these log requests down, here is the way you would look at this based on the first request example above. (Note: This was a call to get Free/Busy Data for a specific time period)

  • prfltncy=98 - Overall Performance Latency for the request
  • prfrpccnt=6&prfrpcltncy=78 - 6 RPC requests with a latency of 78ms
  • prfldpcnt=0&prfldpltncy=0 - 0 LDAP requests with a latency of 0ms
  • prfavlcnt=0&prfavlltncy=0 - 0 Availability requests with a latency of 0ms

You can use any log parser (ie.logparser.exe) of your choice to get further information, but this should help you understand some of the latencies down to a per request level.

I hope this helps in your performance troubleshooting…..

The Case of the Mysterious Exchange Server Hang

$
0
0

Recently we had a case in which an Exchange 2003 server would hang and no longer accept any new RPC connections to the Information Store. The rest of the server seemed to be operating just fine, but it was the Store that was ultimately having the problems.

I took a look at the perfmon data that was provided and didn’t see anything out of the ordinary except for a small amount of RPC Operations taking place on the server. The server did look like it was processing data though, so this was quite intriguing to me now. I did notice that one DC had a number outstanding LDAP requests for an extended period of time as shown below.

image

We ended up taking some dumps of the Store, IIS processes, and LSASS to see what might be going on. The store and IIS dumps were not that interesting. Looking at the LSASS dumps was an eye opener. We saw that over 150 threads were hung up calling (SECUR32!CALLSPM) in to the Security Provider Manager (SPM). The beginning of the stacks were showing secur32!LsaAcceptSecurityContext calls which were mostly client authentication calls to the server. More info on the AcceptSecurityContext calls can be found here

There was almost 200 other threads that were calling netlogon!NlpUserValidateHigher which essentially means that we are trying to send a user validation request to a higher authority for authentication requests over the secure channel. Once we accept this validation request, we then attempt to connect to the DC over RPC to handle the request. Debug analysis can be found on Dave Goldman’s blog here.

By default, Netlogon only allows 2 concurrent API calls for these authentication requests which is controlled by a semaphore. If the 2 semaphore objects are tied up waiting for a response from the DC, all other requests will start queuing, thus having this mysterious hang affect on the Exchange Server. This was our problem since the debug analysis showed that we hit our maximum of 2 concurrent requests most likely to an overloaded DC, leaving a backlog of requests for authentication traffic. This request queue is controlled by the MaxConcurrentApi setting. Each request has a default timeout of 45 seconds, so if there were requests that were timing out, this is surely going to cause some delays for other users. On healthy servers with good network connectivity, these authentication requests are extremely fast.

At this point, we knew that we were tied up in authentication calls to DC’s, but we couldn’t find out what users were trying to logon which were taking the most amount of time. The debug information only shows a point in time. It is possible that a user could be trying to authenticate to a down-level domain across a slow WAN link, not sure at this time.

To move forward, we enabled Netlogon Debug logging per http://support.microsoft.com/kb/109626 and let the problem occur again.

We opened the netlogon.log file and started reviewing the information. Prior to the problem we can see that responses are returning in a timely manner. Notice the time intervals happen within the same second

07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of US\User1 from COMPUTER1 Returns 0x0

07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:36:11 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0

As traffic increases, the response times are starting to get a little slower

07/31 11:53:56 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:54:14 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0x0

07/31 11:53:57 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:54:17 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

Now we see a response time right at 45 second timeout below

07/31 11:57:02 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Entered
07/31 11:57:47 [LOGON] SamLogon: Network logon of EUROPE\User2 from COMPUTER2 Returns 0x0

Here is where our first netlogon timeout hit

07/31 11:57:03 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Entered
07/31 11:57:48 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:57:48 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.
07/31 11:57:48 [SESSION] I_NetLogonGetAuthData called: (null) EXDOMAIN(Flags 0x1) 
07/31 11:57:48 [LOGON] SamLogon: Network logon of ASIA\User3 from COMPUTER3 Returns 0xC000005E

0xC000005E = STATUS_NO_LOGON_SERVERS

Now we are seeing that we cannot allocate a Client API slow because the max request queue is busy servicing other requests

07/31 11:58:55 [CRITICAL]EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 11:58:55 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.

07/31 12:38:08 [CRITICAL] EXDOMAIN: NlAllocateClientApi timed out: 0 258
07/31 12:38:08 [CRITICAL] EXDOMAIN: NlpUserValidateHigher: Can't allocate Client API slot.

Now we get to an actual DC timeout error as shown below.

08/01 17:21:24 [CRITICAL] NlPrintRpcDebug: Couldn't get EEInfo for I_NetLogonSamLogonEx: 1761 (may be legitimate for 0xc0000064)
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: timeout call to \\DC1.domain.com.  Count: 2
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlFinishApiClientSession: dropping the session to \\DC1.domain.com
08/01 17:21:24 [CRITICAL] EXDOMAIN: NlSetStatusClientSession: Set connection status to c000005e

We can see clearly now that DC1 was having problems servicing authentication requests to this Exchange server. This does not always mean that the DC is overloaded, it could be a down level network trust that is really slow that is causing this problem, so additional investigation needs to be performed at this point. We just know that Exchange is the victim and the problem is elsewhere now.

Troubleshooting methodologies

So what can we do at this point?

We can test secure channels for different domains to see which domains might be failing. First we will need to obtain the DC in which the secure channel is currently formed on the Exchange server by running nlttest /sc_query:<domain> replacing domain with the domain name where the Exchange Server resides in.

Once that DC is found, you will then run a command similar to the following for each of the domains:

nltest /server: DC1   /sc_query:ASIA
nltest /server: DC1   /sc_query:EUROPE
nltest /server: DC1   /sc_query:US

This will help fish out any down level domains that could be causing authentication delays.

You can also enable netlogon debug logging on the DC’s to help understand the traffic patterns there.

Installing the Server Performance Advisor on the Windows 2003 DC’s or using the Active Directory Diagnostics Data Collector in the Windows 2008 Reliability and Performance monitor will help fish out any potential bottlenecks.

Take netmon captures and search for NetrLogonSamLogonEx entries for Netlogon requests

For Windows 2003 servers, you can install the following hotfix to help track these type issues faster. This hotfix adds new performance counters to help track access to these semaphores better. Windows 2008 servers already have this built in to the OS

New performance counters for Windows Server 2003 let you monitor the performance of Netlogon authentication
http://support.microsoft.com/default.aspx/kb/928576

The main ones you want to look at at the following

Semaphore Holders: How many threads on average are holding the client semaphore

This is the number of threads trying to get a netlogon session to a DC that are Blocked. Blocked could be locked open by a process, network down, etc. when semaphore waiters is non 0, some local process is waiting on lsass for a response and the lsass thread is blocked. This correlates to the MaxConcurrentApi setting

By default this value should be less than 2 at any given time. If values about 2 are sustained, then either the Exchange server or DCs are overloaded.

Average Semaphore Hold Time: The average wait time for a thread to acquire the semaphore

These values should normally be very quick. Longer hold times mean that a potential bottleneck is occurring.

Semaphore Waiters: The average number of waiters waiting on the semaphore.

This value should remain at 0 at all times. Short bursty spikes are OK to see as that simply means that we had a large amount of requests which were handled in a short period of time.

In some instances on heavily loaded servers, you may want to adjust MaxConcurrentApi to a value of 5 on both the Exchange Servers and DC’s to help widen the pipe or increase the amount of auth requests that can occur at any given time. Bumping this setting up may help alleviate this problem altogether, but could also prolong the issue due to some other underlying issue that has now been masked. It’s always best to understand where the problem is coming from before making any major changes such as this which may increase overall processor utilization on the Exchange server and your domain controllers.

In this instance, we set MaxConcurrentApi to 5 on the DC’s and Exchange Servers and this appears to have reduced the amount of occurrences of this problem.

This particular problem not only affects Exchange servers, but also affects other applications such as ISA server. More information on this can be found here

I hope this provides some insight in to some of the underlying dependency problems that you may seen in Exchange.

That is all for now.


Exchange 2007 Memory usage and helpful troubleshooting tips

$
0
0

In support, we get a lot of statements stating that Exchange is using all or most of the memory on a given server. Some may say that is a bad thing, but for Exchange 2007 that is actually a good thing. In this blog, I would like to explain some of the misconceptions of Exchange’s memory usage with relationship to overall/allocated memory and the Paging file and it’s usage. I previously blogged about Exchange 2007 memory usage at Understanding Exchange 2007 Memory Usage and its use of the Paging File, but it appears that more clarification is needed in this area. I am also going to show some real world screenshots of customers actual perfmon log files that show good and bad behavior and how this might help you in troubleshooting what type of a memory issue you might have, if any.

So let’s start with the paging file and it’s usage as that appears to be a common question that comes up all of the time. Some of the questions stem around PF Usage in Task Manager as shown below on a Windows 2003 server and server monitoring software reporting this as a problem. PF Usage in Task Manager is the total number committed pages in the system. This is not how much is currently being used, it is merely the amount of page file space that has been allocated should the OS need to page out currently committed bytes

image

In Windows 2008, Task Manager now shows different terminology as PF Usage has been removed and has been replaced with just the word Memory

image

There are other counters that show PF usage as well, one is Paging File\% Usage which shows overall usage and Process\Page File Bytes which shows per process Page file allocation. The % Usage counter is about the same as what Task Manager PF Usage shows. It is just the amount of space that has been allocated should committed bytes need to get paged out and doesn’t indicate if the PF is currently being utilized. Paging File\% Usage is a counter that monitoring software commonly shows that could be a potential problem, but in all reality it might not be. Other factors needs to be looked at other than the amount of page file usage to get a clear indication if there is truly a problem or not.

Generally, Page file usage should remain under 80% at all times, but there are times when the OS needs to make use of the paging file and one example is a working set trim operation. The following picture shows an example of this working set trim operation for store.exe where the Memory\% PF Usage shows that we increase the PF usage at the same time the working sets are getting trimmed to satisfy some other application or driver request for allocating a contiguous memory block. You will also notice that PF usage never really bounces back either after something like this happens and remains around a sustained average for the remainder of the server being online or until the server is rebooted. Unless you are getting close to the max Memory\% Committed Bytes In Use, we shouldn’t be too concerned with the PF Usage unless we are seeing some high paging activity going on.

image

With that said, you would not use PF usage in Task Manager or Paging File\% Usage to determine if the paging file is currently being used. What you would use to monitor this is the amount of Memory\Pages/Sec that are occurring. This counter is a combination of both Memory\Pages Input/sec and Memory\Pages Output/Sec counters which also includes access to the system cache for file based operations to resolve hard page faults. Hard page faults occur when a page from a process is requested but does not exist in memory. This then means that we have to pull that data directly from the paging or backing file. If the page is elsewhere in memory, then this is called a soft fault. These two counters will help you understand if you are writing data (Pages Output) to the Paging file or you are reading data (Pages Input) from the paging file which might be affecting overall Exchange Server performance. Hard Page Faults can result is significant delays in processing data on the server.

Counter Definitions
Memory\Pages/Sec
  - Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays.  It is the sum of Memory\Pages Input/sec and Memory\Pages Output/sec.  It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory\Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files.

Memory\Pages Input/sec - Pages Input/sec is the rate at which pages are read from disk to resolve hard page faults. Hard page faults occur when a process refers to a page in virtual memory that is not in its working set or elsewhere in physical memory, and must be retrieved from disk. When a page is faulted, the system tries to read multiple contiguous pages into memory to maximize the benefit of the read operation. Compare the value of Memory\\Pages Input/sec to the value of  Memory\\Page Reads/sec to determine the average number of pages read into memory during each read operation.

Memory\Pages Output/Sec - Pages Output/sec is the rate at which pages are written to disk to free up space in physical memory. Pages are written back to disk only if they are changed in physical memory, so they are likely to hold data, not code. A high rate of pages output might indicate a memory shortage. Windows writes more pages back to disk to free up space when physical memory is in short supply.  This counter shows the number of pages, and can be compared to other counts of pages, without conversion.

Recommended guidance states that the size of the paging file should be RAM+10MB for optimal performance and should be of static size and not system managed. Having a paging file set to system managed could cause page file fragmentation which could affect performance in memory pressure conditions, but Exchange generally should not be making use of the paging file for normal operations. If virtual memory is shown to be problematic or high due to other applications on the servers requiring it, you can increase the size of the paging file to RAM * 1.5 to help alleviate some of this memory pressure on the server to help back all of the committed pages in memory. If you are still having problems at this point, check for potential memory leaks within the processes on the server.

High paging in excess of 10,000/sec or more could indicate severe memory pressure or a working set trimming problem that I talked about previously in http://blogs.technet.com/mikelag/archive/2007/12/19/working-set-trimming.aspx.

The amount of available memory is another question that comes up regularly. The main performance counter to monitor for available memory is Memory\Available MBytes. This is the amount of physical memory that is available for process or system use. It is the sum of Free, Zero, and Standby (cached) page lists. If you are on a Windows 2008 server and run Process Explorer viewing System Information, you will see these page lists referenced. Available RAM on any given Exchange 2007 server should not go below 100MB. After crossing the 100MB threshold, you are putting your server in a state vulnerable for working set trims when the Virtual Memory manager needs to process a memory allocation request in which sufficient RAM is not available to service that request. Another counter to check to cross correlate why available memory is low is Memory\System Cache Resident Bytes. Memory\System Cache Resident Bytes is part of the overall System cache which is viewable via the Memory\Cache Bytes counter.

image

The above picture is a depiction of how System cache can affect available memory leading up to a working set trim. Notice in yellow that the Store cache remains consistent prior to the trim, so we know that Exchange did not cause this, but rather some other application. This could be some application making use of the file cache causing this increase. A simple file copy operation of a very large file from this server to another server will cause this problem. You can tame this system cache problem by using the Windows Dynamic Cache service shown at http://blogs.msdn.com/ntdebugging/archive/2009/02/06/microsoft-windows-dynamic-cache-service.aspx. In the above case, it was Antivirus software making use of memory mapped files.

Note: If available RAM is about 100MB, please do not RDP in to the server and fire up the EMC for administration purposes. This will exhaust all RAM on the server and cause working set trim issues. Got to love that one, eh?

Next, I would like to talk about Committed Memory. There are two main counters that I look at when troubleshooting memory related issues to determine if we are truly running out of memory on a server. These counters are Memory\Committed Bytes and Memory\Commit Limit.

Memory\Committed Bytes is the amount of committed virtual memory, in bytes. Committed memory is the physical memory which has space reserved on the disk paging file(s). This counter displays the last collected value and is not an average.

Memory\Commit Limit is the amount of virtual memory that can be committed without having to extend the paging file(s).  It is measured in bytes. Committed memory is the physical memory which has space reserved on the disk paging files. There can be one paging file on each logical drive). If the paging file(s) are be expanded, this limit increases accordingly.  This counter displays the last collected value and is not an average. The Commit Limit is calculated by taking the amount of total RAM and adding that to the Paging File sizes. This sum will give you your overall Commit Limit on any given server.

There are a few ways to view the current values of the Commit Limit and Committed Bytes. In Task Manager, you could view the Commit Charge (K) area as shown in the above screenshot. You can view these counters in Perfmon, and of course using Process Explorer shown below.

 image

 

Real World Scenarios

Now that we have all of this knowledge, let’s take a look at some real world examples here.

The below picture shows a normal working server where the Store working set remains consistent throughout the lifetime of the store process due to the cache being warmed up or fully populated. This is where you get maximum performance from your Exchange server since you are caching all of the data in memory instead of having to rely on paging information to/from the disks. You will also notice that available memory is just under 2GB. The amount of committed bytes is also no where close to the Commit limit on the server.

image

The following example shows that our Committed Bytes is just about equal to the overall Commit Limit on the server. Any new memory allocations could be failing causing instability of your Exchange server. This problem was attributed to an incorrectly configured paging file on the server.

image

The next example shows an actual Store memory leak. As you can see, the Committed Bytes (blue), Private Bytes (pink) and Virtual Bytes (yellow) for Store is also increasing upward to the overall Commit Limit (green). This occurred due to a recursive operation within the store process exhausting all of the memory on the server. A recursive operation can be thought of a process that is being performed where one of the operations is to repeat the process. This is similar to a loop with no ending or a way to break out of the loop.

image

I hope this clears up some of the misconceptions behind  the command phrase “Exchange is using all the memory”.

Exchange 2007 SP2 Auditing Whitepaper

$
0
0

Exchange 2007 SP2 has introduced some new Mailbox Access Auditing features to help log events when users access folders and messages either in their own mailbox or another users mailbox. I wrote a whitepaper on these new features at http://technet.microsoft.com/en-us/library/ee331009.aspx. This new access auditing will log accesses to messages and folders which some customers have been wanting for a long time. So if you attempt to access another users folder and open or read a message, Exchange will now log events in the new Exchange auditing log on the server. This only shows you the path of access to message and folders, but does not specifically log deletions of messages in users folders.

The whitepaper also discusses how you can setup auditing to track configuration changes to Exchange related objects in Active Directory, so that if an administrator made a change to an Exchange configuration object that caused an outage, these events will now be logged on the domain controllers security event log. If your DC’s are Windows 2008, you can see what the previous values were and what the newly changed value is, so if you need to change it back to the way it was before the outage, you have a rolling log of all of these changes.

If you have some time and wanted to read more about it, see the above link for more details. This took a lot of time and effort on my part to pull this together and test most of the configuration auditing pieces to ensure that we were logging the correct data. Hope you enjoy it.

New ADAccess Performance counters included with Exchange 2007 SP2

$
0
0

Exchange 2007 SP2 has a new set of ADAccess Performance counters that only shows performance data from domain controllers in the same site as the Exchange Server. This new object is MSExchange ADAccess Local Site Domain Controllers. Previously, you had to use MSExchange ADAccess Domain Controllers(*)\Local site flag to detect if the server was local via Performance monitor.

Here is a listing of the new counters. They are very similar to the MSExchange ADAccess Domain Controllers counters, but only for local DCs.

\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read calls/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search calls/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Searches timed out per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Fatal errors per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Disconnects per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\User searches failed per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\Bind failures per minute
\MSExchange ADAccess Local Site Domain Controllers(*)\Long running LDAP operations/Min
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Pages/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP VLV Requests/Sec
\MSExchange ADAccess Local Site Domain Controllers(*)\Number of outstanding requests
\MSExchange ADAccess Local Site Domain Controllers(*)\DsGetDcName elapsed time
\MSExchange ADAccess Local Site Domain Controllers(*)\gethostbyname elapsed time
\MSExchange ADAccess Local Site Domain Controllers(*)\Kerberos ticket lifetime
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP connection lifetime
\MSExchange ADAccess Local Site Domain Controllers(*)\Reachability bitmask
\MSExchange ADAccess Local Site Domain Controllers(*)\IsSynchronized flag
\MSExchange ADAccess Local Site Domain Controllers(*)\GC capable flag
\MSExchange ADAccess Local Site Domain Controllers(*)\PDC flag
\MSExchange ADAccess Local Site Domain Controllers(*)\SACL right flag
\MSExchange ADAccess Local Site Domain Controllers(*)\Critical Data flag
\MSExchange ADAccess Local Site Domain Controllers(*)\Netlogon flag
\MSExchange ADAccess Local Site Domain Controllers(*)\OS Version flag
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Read Time
\MSExchange ADAccess Local Site Domain Controllers(*)\LDAP Search Time

Unfortunately. upgrading to SP2 from an earlier service pack does not reload the MSExchange ADAccess counters, so you will have to do this manually. If you are installing Exchange using the SP2 binaries, you will have these new counters by default. To reload the MSExchange ADAccess counters, do the following:

  • Ensure that no other monitoring software is currently collecting performance counter data
  • Open a command prompt and change directory to the \Program Files\Microsoft\Exchange Server\Bin\perf\AMD64 directory
  • To unload the performance counters, type the following:
    unlodctr “MSExchange ADAccess”
  • To Reload the counters, type the following:
    loadcounter dscperf.ini
  • Restart the Exchange Services to successfully reload the counters. Note: This step is very important as Exchange opens file handles to the original counters that can only be reloaded with the restart of the Exchange Services.

For all of you that are collecting performance counters via WMI, you may notice that these new counters will not appear to be loaded. You can verify this by running perfmon/wmi to see if they are there. If they are not, you can transfer the PDH settings over to WMI by running wmiadap /f.

Enjoy!!

How to monitor and troubleshoot the use of Nonpaged pool memory in Exchange Server 2003 or in Exchange 2000 Server

$
0
0

This article is a high level overview on how to troubleshoot current Nonpaged pool memory usage on an Exchange server.  It explains what could be done to help mitigate some of the underlying problems that may be consuming Nonpaged pool memory and demonstrates tools that can be used to help track down processes or drivers consuming the most amount of memory.

Nonpaged pool memory is a limited resource on 32-bit architecture systems.  It is dependent on how the server is setup to manage memory and is calculated at system startup. The amount of nonpaged pool allocated on a given server is a combination of overall memory, running/loaded drivers and if the /3GB switch has been added to the boot.ini file.

Nonpaged pool memory is used for objects that cannot be paged out to disk and have to remain in memory as long as they are allocated. Examples of such objects may be network card drivers, video drivers and Antivirus filter level drivers.  By default, without the /3GB switch, the OS will allocate 256MB of RAM on a server for a Nonpaged pool. When the /3GB switch is added and the server is rebooted, this essentially halves the amount of Nonpaged pool memory on a given server to 128MB of RAM. The Windows Performance team has a table listed in http://blogs.technet.com/askperf/archive/2007/03/07/memory-management-understanding-pool-resources.aspx that discusses what the max pool memory resources can be on any given server. This link also disusses how to view the maximum amount of pool memory on any given server using Process Explorer. For Exchange servers, it is recommended to add the /3GB switch to the boot.ini file with the exception of pure HUB or Front End (FE) servers to allocate more memory to the user processes. As you can see, this limits how much you can load within that memory space. If this memory has been exhausted, the server will start becoming unstable and may become inaccessible. Unfortunately, since this memory cannot be paged in and out, you cannot resolve this problem without rebooting the server.

On Microsoft Windows 2003 64-bit operating systems, the Kernel Nonpaged pool memory can use as much as 128GB depending on configuration and RAM. This essentially overcomes this limitation. See 294418 for a list of differences in memory architectures between 32-bit and 64-bit versions of windows. Currently, the only version of Exchange that is supported on a 64-bit operating system is Exchange 2007, so when working with previous versions of Exchange we may still run into this Nonpaged pool limitation.

Symptoms

When Nonpaged pool memory has been depleted or is nearing the maximum on an Exchange Server, the following functionality may be affected because these features require access to HTTP/HTTPS to function:

  1. Users connecting via Outlook Web Access may experience “Page cannot be displayed” errors.

    The issue occurs when nonpaged pool memory is no longer sufficient on the server to process new requests.  More information on troubleshooting this issue is available in the following KB article:
    Error message when you try to view a Web page that is hosted on IIS 6.0: "Page cannot be displayed"
    http://support.microsoft.com/?id=933844

    Note: If this resolves your OWA issue, it is recommended to determine what is consuming nonpaged pool memory on the server. See the Troubleshooting section of this document for help in determining what is consuming this memory.
  2. RPC over HTTP connections are slow or unavailable.

    If you experience difficulties when you use an Outlook client computer to connect to a front-end server that is running Exchange Server 2003 it can indicate a depletion of Nonpaged pool memory.  HTTP.sys stops accepting new connections when the available nonpaged pool memory drops under 20MB.  More information on troubleshooting this issue is available in the following KB article:

    You experience difficulties when you use an Outlook client computer to connect to a front-end server that is running Exchange Server 2003
    http://support.microsoft.com/?id=924047
  3. The IsAlive check fails on Cluster

    The cluster IsAlive checks for the Exchange HTTP resource on a cluster server may fail causing service outages or failovers. This is the most common scenario that we see for Exchange 2003 clusters. When there is less than 20MB of nonpaged pool memory, http.sys will start rejecting connections affecting the IsAlive check.

    When nonpaged pool is becoming exhausted, the IsAlive check fails causing the resource to fail. Depending on your recovery settings for the HTTP resource in Cluster Administrator, we will try to either restart the resource or fail over the group. By default, we will try restarting the resource 3 times before affecting the group. If this threshold is hit, the entire group will fail over to another cluster node.
    To verify if nonpaged pool has been depleted, you can look in 2 possible locations. One is the cluster.log file and the other is the httperr.log

    Cluster.log
    For the cluster.log file, you may see an entry similar to the following:

    00000f48.00000654::2007/05/16-17:16:52.435 ERR Microsoft Exchange DAV Server Instance <Exchange HTTP Virtual Server Instance 101 (EXVSNAME)>: [EXRES] DwCheckProtocolBanner: failed in receive. Error 10054.

    Error 10054 is equivalent to WSAECONNRESET which is http.sys rejecting the connection.

    Httperr.log
    In the httperr.log that is located in the %windir%\system32\logfiles\httperr directory on the Exchange Server, you may see entries similar to the following.

    2007-05-16 16:44:56 - - - - - - - - - 1_Connections_Refused -
    2007-05-16 16:50:42 - - - - - - - - - 3_Connections_Refused -
    2007-05-16 16:50:47 - - - - - - - - - 2_Connections_Refused -
    2007-05-16 17:16:35 - - - - - - - - - 5_Connections_Refused –

    This confirms that http.sys is rejecting the connection to the server. Additional information regarding this logging can be found in the following article:

    Error logging in HTTP API
    http://support.microsoft.com/?id=820729

    Additional information for this issue is available in the following KB:

    Users receive a "The page cannot be displayed" error message, and "Connections_refused" entries are logged in the Httperr.log file on a server that is running Windows Server 2003, Exchange 2003, and IIS 6.0
    http://support.microsoft.com/?id=934878
  4. Random Server Lockups or Hangs
  5. Certain operations failing because of the lack of memory to support new operations.
    Check the Application and System logs where common operations might be failing.
Potential Workaround to provide immediate/temporary relief

If immediate relief is needed for all these scenarios to prevent these rejections from occurring on a cluster server, then you can add the EnableAggressiveMemoryUsage registry key on the server for temporary relief. When this is added, http.sys will then start rejecting connections when there is less than 8MB of Nonpaged pool memory available, overriding the 20MB default value. See 934878 for more information on setting this key. Note:  Please use this as a temporary method to get the Exchange cluster resources back online and investigate the underlying cause of who is taking up the most amount of Nonpaged pool memory on the server. An ideal situation would be having 100MB or less of overall Nonpaged pool memory consumed on any given server.

Nonpaged Pool Memory Depletion events

When pool memory has been depleted, you may start receiving the following error in the System Event log stating that a specific pool memory has been depleted.

Event ID 2019
Event Type: Error
Event Source: Srv
Event Category: None
Event ID: 2019
Description:
The server was unable to allocate from the system NonPaged pool because the pool was empty.

If you are getting these events, then the server is most likely very unstable currently or will be very soon. Immediate action is required to bring the server back online to a fully functional state such as moving the cluster resources to another node or rebooting the server that has this problem.

Troubleshooting

There are a couple of different ways to view the amount of real-time pool memory usage that is currently being consumed and the easiest one is Task Manager. Once you pull up Task Manager, you will need to click the Performance tab and in the lower right hand corner, you will see the amount of pool memory usage that is highlighted. If nonpaged pool is 106MB or more, then there is a possibility that the cluster IsAlive checks for the HTTP resource are failing or close to failing.

image

You can also view Nonpaged and Paged Pool usage per process on the Processes tab in Task Manager. I’ve added the Paged Pool column since the same basic rules applies there too. To do this, select the Processes tab, select View on the menu, and then Select Columns. Add Non-paged Pool, Paged Pool, and the Handles columns as shown below.

image

Once this column is added, you can now view pool usage per process which may help you track down what process is consuming the most amount of memory. You can sort each column to look for the highest consumer. The handle column is added to help determine if there is any process that may have a large amount of handles consuming a larger amount of nonpaged pool memory. (Note: A high handle count may affect either paged or nonpaged pool memory, so keep this in mind when analyzing data.) 

image

Another way of looking at handles for any given process is to use Process Explorer available here.  To add the handle count column, you would select View on the menu, then “Select Columns”, click the Process Performance tab, and then put a check box next to “Handle Count”. Click OK.

image

If you can’t determine from there what is consuming the memory, this may be a kernel related problem and not application specific. This will require some additional tools to determine what could be affecting the nonpaged pool memory.

One of the first things to look for is drivers that are more than 2 years old that may have had some issues in the past, but have been resolved with later driver releases. Running the Exchange Best Practices analyzer tool (ExBPA) located here can help report any drivers that may be outdated or have been known to have issues previously. If ExBPA did not report any problems with the configuration of the server or any driver related problems, further troubleshooting is necessary.

If the Windows Support tools are installed, you can use a tool called Poolmon to allow you to view what specific tags are consuming memory. More information regarding Poolmon can be found in the Windows Support Tools documentation here.  To run Poolmon, simply open up a command prompt and type “Poolmon” and then hit the “b” key to sort on the overall byte usage (Bytes) with the highest being at the top. Anything you see that is highlighted means that there was a change in memory for that specific tag.

In this view, you want to look at the top five consumers of memory which should be listed at the top. For the most part, you will be looking at the first two columns named Tag & Type.  The Tag is specific to a particular driver and the Type column indicates what type of memory is being used, nonpaged pool (Nonp) or paged pool (Paged) memory.  You will also be looking at the Bytes (shown in yellow) column. This column shows the bytes in use for the particular process Tag.

clip_image005

The Allocs and Frees columns can be used to determine if a tag is leaking memory. If there is a large difference between these two columns for a particular tag, then there may be a leak in that particular tag and should be investigated.

The file Pooltag.txt lists the pool tags used for pool allocations by kernel-mode components and drivers supplied with Windows, the associated file or component (if known), and the name of the component.

Where to get Pooltag.txt?

After install the debugging tools for windows located here, pooltag.txt can be found in the C:\Program Files\Debugging Tools for Windows\triage directory and normally has the most recent list of pooltags.

Pooltag.txt can also be obtained from the Windows Resource Kit:

http://www.microsoft.com/downloads/details.aspx?FamilyID=9D467A69-57FF-4AE7-96EE-B18C4790CFFD&displaylang=en

If the specific tag in question is not listed in pooltag.txt and is leaking memory, you can search for pool tags used by third-party drivers using the steps in the following article:

How to find pool tags that are used by third-party drivers
http://support.microsoft.com/default.aspx?scid=kb;EN-US;298102

Once you find what tag pertains to a specific driver, you would contact the vendor of that driver to see if they should have an updated version that may help alleviate this memory leak issue.

Recommended remediation

  1. Install the recommended hotfixes for Windows 2003 server based clusters from 895092
  2. Run the Exchange Best Practices Analyzer (ExBPA) tool to ensure that the exchange server is configured optimally. (ie: SystemPages registry setting, any outdated network card drivers, video drivers or storage drivers (storport.sys or SAN drivers), Mount point drivers (mountmgr.sys), boot.ini settings, etc.)
  3. Ensure that Windows 2003 SP2 is installed. If SP2 is not installed, at a minimum, you need to apply the hotfix in 918976
  4. Ensure that the Scalable Networking Pack features have been disabled. See http://msexchangeteam.com/archive/2007/07/18/446400.aspx for more information on how this can affect Exchange Servers
  5. Upgrade ExIFS.sys to the version listed in 946799
  6. If using MPIO, ensure 923801 at a minimum is installed. 935561 is recommended. Also see 961640 for another known memory leak issue
  7. If Emulex drivers are installed, be sure to upgrade to the version listed here to help with nonpaged pool memory consumption.
  8. Disable any unused NICs to lower overall NPP memory consumption
  9. Update network card drivers to the latest version.
    • If Jumbo Frames are being used, be sure to set this back to the default setting or lower the overall frame size to help reduce NPP memory usage.
    • If Broadcom Drivers are being utilized and are using the Virtual Bus Device (VBD) drivers, be sure to update the drivers to a driver version later than 4.x. Check your OEM manufacturers website for updated versions or go to the Broadcom download page here to check on their latest driver versions.
    • Any changes to the Network Card receive buffers or Receive Descriptors from the default could increase overall NPP memory. Set them back to the default settings if at all possible. This can be seen in poolmon with an increase in MmCm pool allocations.
  • Update video card drivers to the latest version. If any accelerated graphics drivers are enabled, go ahead and uninstall these drivers and switch the display driver to Standard VGA. Add the /basevideo switch to the boot.ini file and reboot the server.
  • Check to see if the EnableDynamicBacklog setting is being used on the server which can consume additional nonpaged pool memory. See 951162.
  • If you are still having problems with NonPaged pool memory at this point, then I would recommend calling Microsoft Customer Support for further assistance with this problem.

    Additional Reading

    Nonpaged Pool is over the warning threshold (ExBPA Rule)
    http://technet.microsoft.com/en-us/library/aa996269(EXCHG.80).aspx

    Understanding Pool Consumption and Event ID: 2020 or 2019
    http://blogs.msdn.com/ntdebugging/archive/2006/12/18/Understanding-Pool-Consumption-and-Event-ID_3A00_--2020-or-2019.aspx

    3GB switch
    http://blogs.technet.com/askperf/archive/2007/03/23/memory-management-demystifying-3gb.aspx

     

    How to fix/repair broken Exchange 2007 counters

    $
    0
    0

    I commonly get calls on the inability to see performance counters in Performance Monitor (perfmon) and the inability to query them through WMI. I thought I would take some time to write about how to look for any problems with Exchange Performance Counters and then provide some high level insight on how to possibly fix them. Most of this information applies to Windows 2003 servers.

    If the counters are not being shown at all, the first place to check is the registry to see if the counters are not disabled. Here is a snippet of what one of the registry keys would look like

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
    "Close"="ClosePerformanceData"
    "Collect"="CollectPerformanceData"
    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
    "Open"="OpenPerformanceData"
    "PerfIniFile"="eseperf.ini"

    If you also see a value of Disable Performance Counters in addition to the above default entries and is set to a nonzero value, the counter at one point had a problem loading and the Operating System disabled them for whatever reason. Set the value to 0 and then close and open Perfmon again to see if you can see the counters again. More information on this Disable Performance Counters setting can be found here . If this works for you, then whew, that was an easy one….

    If the Performance key is missing for a particular service, then we have bigger problems. I am not sure what causes this key to get removed, but if the key is not there, Perfmon or WMI does not know how to load the counters. There are a couple of key required parts that you need to understand before we can load any Performance counter, not just Exchange. The key pieces that are needed to reload any Performance counter is the following:

    • Performance key must be created under the specified service
    • Library path must be specified to the appropriate DLL for the service
    • A PerfIniFile must be specified which is the name of the ini file that will reload a specific services performance counters
    • Lastly, we need to have the Close, Collect, and Open values which specify what method is used to retrieve the Performance Counter Data. Note: This is unique to each service, so they will not always have the same information

    If we have these key pieces of information in the registry, we have the ability to reload said services performance counters. If we take the example above for ESE, if we opened a command prompt and navigated to the C:\Program Files\Microsoft\Exchange Server\bin\perf\AMD64 directory and then typed lodctr eseperf.ini, this will reload the counters for ESE. If the counters were loaded successfully, we should now see that we have also added the First Counter, First Help, Last Counter, Last Help values as shown below.. These values correspond to specific data that was loaded in to the Perflib library.

    image

    If everything went well and you reopen Perfmon, You should now hopefully see the counters loaded. If they have not loaded, refresh the registry to see if the Disable Performance Counters key shows back up, If not, check the application log for Perflib errors which should provide additional information regarding why these counters did not load successfully.

    If you don’t know already, on Windows 2003 servers, you can actually pull up performance counters using the command Perfmon /WMI. If you do not see the newly added counters, then they have not been synchronized with the WMI repository yet. To help force this along, you could run wmiadap /f to force the reload of all counters in to the WMI repository.

    If this was successful, you will now see some additional Wbem entries as shown in the below pictorial.

    image

    Pulling up Perfmon /WMI again should hopefully show the counters that you are looking for. In some cases, monitoring software can still not pick up the newly added counters until the WMI service (Windows Management Instrumentation) has been restarted.

    If you ever wanted to unload Performance counters, one might think that you could simply unload the counters by running unlodctr eseperf.ini. Unfortunately, this will not work because the unlodctr utility requires that a service name be passed in instead of the ini file. To find the actual name of the service, you could simply open up eseperf.ini and at the top of the file, you should notice an entry similar to the following

    [info]
    drivername=ESE

    Ahh, there is the service name. Now if I run unlodctr ESE, this will now be successful. Doing this will remove the First Counter, First Help, Last Counter, Last Help values from the registry.

    Hopefully you are still with me at this time. Now what happens if the performance registry keys for all of your services went missing, now what do you do? Reinstall, flatten the box and reinstall to get them back? Well, unfortunately, there is not a direct way of recreating these registry keys as they are created during the installation of Exchange.

    The majority of the folks just export the data from another server, clean out any of the data that references performance counter data from the old server and then import them on the affected server. This does in fact work and is what I am going to talk about next on how to recover from a complete Performance key meltdown.

    Attached to this post is a zip file that contains all of the Performance keys across various different role combinations such as MBX, CAS, HUB, HUB/CAS, HUB/CAS/MBX. I’ve done all of the dirty work for you, so all you have to do is to perform some very simple modification steps to the files and then you are in business.

    CAUTION!!!: DO NOT IMPORT these registry keys if the Performance registry keys already exist as it will overwrite the data that currently exists in the registry and could potentially break your Performance counters that are currently working. If you only need to reload the Performance key for a single service, then pull out the data for that specific service, save it to a reg file and then import only that data. Basically use it as a reference point to help get you back running again.

    If you feel the need to use these reg import files due to all of the performance keys missing for all services, simply open the file that pertains to the role that you have installed and verify that the paths are correct to the correct library files. By Default, we install Exchange in to the to c:\program files\microsoft\Exchange Server directory, so if Exchange was installed outside of the default directory, you will need to update the file manually. Let’s take the ESE performance key below:

    [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ESE\Performance]
    "Close"="ClosePerformanceData"
    "Collect"="CollectPerformanceData"
    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"
    "Open"="OpenPerformanceData"
    "PerfIniFile"="eseperf.ini"

    Here you will see that library has the following value:

    "Library"="C:\\Program Files\\Microsoft\\Exchange Server\\bin\\perf\\%PROCESSOR_ARCHITECTURE%\\eseperf.dll"

    What you will need to do is to replace the path with the correct path in which you have installed Exchange. If you installed Exchange on D: in the following directory (D:\\Program Files\\Microsoft\\Exchange Server\\bin), you would simply need to modify the first part of the path to show D:\\ instead of C:\\. A quick find and replace should work to hit all Performance keys. If you have installed it in to another Directory outside of the default paths, then you have a little more work to do to replace the path information. Just remember that for each backslash (\), you have to include double-backslashes (\\) to allow for proper importing of the reg files.

    There are only a handful of entries you have to manually modify, so this really shouldn’t take too long. Once you have the paths changed, save the appropriate file as a .reg file and import it by double-clicking on the file. Verify the Performance reg keys are good and valid by opening the Registry Editor to verify.

    Once the keys have been verified in the registry and look good, you can then run the powershell script to reload all of the Exchange performance counters. Simply copy the ReinstallAllPerCounters.pst.txt file to the Exchange server and then remove the .txt extension on the file. Open the Exchange Management Shell and then run the script. The screenshot below shows each ini file attempting to be loaded. Of course, on my server, I already had all of the performance keys, so we simply reported that the counters were already installed.

    clip_image002[6]

    Note: If you would like to transfer this data to WMI, simply type Y when asked.

    Once this has completed, be sure to check the application event log for details on any counters that failed to load. If everything went well, voila, you should have most if not all of your Exchange Performance Counters back once again.

    If the counters are still not showing up for whatever reason in WMI, you can run the following 2 commands to clear the WMI Adap cache and then re-sync the counters again to hopefully kick start things once again.

    See http://msdn.microsoft.com/en-us/library/aa394525(VS.85).aspx for more information on some of the additional commands included with the winmgmt command.

    Hopefully this will help you out trying to get your Exchange performance counters going once again.

    The case of the slow Exchange 2003 Server – Lessons learned

    $
    0
    0

    Recently we received a case in support with an Exchange 2003 server where message delivery was slow and the Local Delivery queue was getting backed up. The Local Delivery queue was actually reaching in to the two thousand range and would fluctuate around that number for extended periods of time.

    So we collected some performance data and all RPC latencies, disk latencies, CPU utilization and many of the other counters that we looked at did not show any signs of any problems. <Scratching Head>

    This is actually a common problem that I have seen where the server is responding OK to clients and everything else appears to be operating normally except for the local delivery queue that continually rises. Even disabling any Anti-virus software on the server including any VSAPI versions does not resolve the problem. So we essentially have a case of a slow Exchange server with no signs of performance degradation using any normal troubleshooting methods.

    The reason may not seem apparently obvious, but let me show you what this common problem is that I have seen in these situations. This not only applies to Exchange 2003, but it also applies to later versions of Exchange.

    In some companies, they need to be able to journal messages to holding mailboxes either on the same server or a different server to maintain a copy of all messages that are sent in the organization for compliance purposes. These journaling mailboxes can get quite large and requires a special level of attention to ensure that the mailbox sizes and item counts for those mailboxes are maintained within reasonable levels. They kind of defy what our normal recommendations/guidance states because item counts in these folders can surely reach tens of thousands of items rather quickly and depends on the amount of mail that is sent within your organization.

    Generally, the special level of attention needed that I mentioned earlier for journaling mailboxes are often overlooked. For each journaling mailbox, you need to have a process that will not only back up the items in these folders, but you need to also have some process that goes in and purges the data out of the mailbox once the backup has been taken. This purging process is necessary to maintain acceptable performance levels on an Exchange server. If these mailboxes are on their own server, user mailboxes are not normally affected. If these mailboxes are on the same server as user mailboxes, then this is where you might run in to some problems.

    In this case that we received, we had found a journaling mailbox that had almost 1.5 million items in the mailbox that was 109GB in size as shown in the below screenshot. Wow!! That is a lot of items in this one mailbox.

    huge journal mailbox-fixed

    If you tried to logon to this mailbox using Outlook, the client would most likely hang for 5-10 minutes trying to query the amount of rows in the message table to generate the view that Outlook is trying to open. Once this view is created, you should now be able to view the items and then get back control of the Outlook client. You may think that you could simply go in and start removing/deleting items from this mailbox to start lowering the overall size of the mailbox. Try as you must, but you will most likely end up trying to do this for days since the performance impact of this amount of items in the mailbox will make this a very painful process. Making any modifications to the messages in these folders will cause the message tables to be updated which for this amount of items is simply going to take an exorbitant amount of time.

    Our standard recommendation for Exchange mailboxes on Exchange 2003 servers is to have item counts under 5,000 items per folder. This guidance can be found in the Understanding the Performance Impact of High Item Counts and Restricted Views whitepaper here.

    A simple troubleshooting step would be to dismount the mailbox store that this mailbox resides in to see if the message delivery queues go down. If all of the queues flush for all other mailbox stores, you have now found your problem.

    If you absolutely need to get in to the mailbox to view some of the data, an Outlook client may not be the way to go to do some housecleaning. An alternative would be to use the MFCMAPI tool to view the contents of the mailbox. MFCMAPI will allow you to configure the tool to only allow a certain number of items to be returned at any given time. If you pull up MFCMAPI’s options screen, you can change the throttling section to limit the amount of rows that are displayed. If you were to put 4800 items in the highlighted section below, you would essentially limit the amount of rows or messages that are queried when the folder is opened to the number that you have entered. This will make viewing some of information a little bit easier, but still would be very cumbersome.

    clip_image002

    There are a couple of workarounds that you can do to clean this mailbox out.

    • If the data in the mailbox is already backed up, you could disable mail for that mailbox, run the cleanup agent and then create a new mailbox for the user. Note: the size of the database will still be huge and will increase backup and restore times even if you should recreate the mailbox. If you are finding that the backup times are taking a long time, you may want to think about using the dial tone database in the next suggestion or possibly moving the mailboxes on this store to a new database AFTER you have cleaned out the problem mailbox and then retiring the old database.
    • If the Mailbox Database houses only this one mailbox, you could simply dial tone that database starting with a fresh database. Instructions on how to do this can be found here
    • Purging the data out the mailbox using Mailbox Manager or some 3rd party tool may work, but keep in mind that you will most likely experience a performance problem on the server while the information is cleaned out of the mailbox and could take possibly hours to run

    Long live that 109GB/1.5million item mailbox!!! :)

    Another way to possibly find the high item count user is to use the PFDavAdmin tool to export items counts in users mailboxes. Steps on how to do this can be found here.

    These cases are sometimes very tough to troubleshoot as any performance tool that you might try to use to determine where the problem might lie would not showing anything at the surface. Since the Exchange server is still responding to RPC calls in a timely fashion, any expensive calls running such as a query rows operation will surely slow things down. If you see that things are slow on your Exchange 2003 server and perfmon does not show anything glaring, one of the first things that I check is item counts in users mailboxes looking for some top high item count offenders. Exchange 2007 can have other reasons for this slowness, but that would be another blog post in and of itself.

    So the moral of the story here is that should you have large mailboxes in your organization that are used as a journaling mailbox, a resource mailbox, or some type of automatic email processing application that might make use of Inbox rules to manipulate data in the mailbox, then you need to be absolutely sure that if the mailboxes are backed up or not, that the item counts in the folders of these mailboxes need to be kept to a reasonable count size or they will bring an Exchange server to crawling mode in trying to process email.

    Just trying to pass on some of this not so obvious information…….

    Improved PAL analysis for Exchange 2007

    $
    0
    0

    I wanted to write a post regarding a lot of work that has gone into updating the Exchange Server 2007 PAL xml threshold files in order to make them more relevant and to more accurately report on Exchange Performance problems. This update couldn’t have been done without the help from Chris Antonakis who was one of the major contributors to all of these updates. Kudos to Chris on this one.

    There are some major updates that you need to be aware of when running the PAL tool to analyze Exchange performance problems and the Mailbox Role was the biggest change on how to look at things.

    Shown below is the selection for the Mailbox Role threshold file which includes a few new questions. These questions will help break down performance problems specific to database drives, log file drives and pagefile drives in the resultant report. Previously, this was an all encompassing generic analysis which didn’t really give you the full picture of actual bottlenecks as there are latency differences between the database and log file drives.

    image

    Adding Database Drive letters is quite easy, and gathering the data for this input can be collected from various areas such as ExBPA and in the BLG file itself. These drive letters could also include Volume Mount Points.

    If you know the drive letters already, then that is great. Let’s say your database drives were on Drive E:, Drive F:, and Drive G:, you would need to enter them separated by a semicolon such as E:;F:;G: as shown in the screenshot above. You would also need to do this for the Log File Drives and the Page File Drives for a more accurate analysiss

    Using an ExBPA report of the server and the Tree Report view would be the best way to get the drive letter and volume mount point information, but sometimes a BLG file may provide enough information regarding volume mount points based on the naming convention that was used (keep in mind though that although the volume mount point is named “<Drive Letter:>\Logs” it may actually contain database files or no files at all). A screenshot below shows the Logical Disk counter that shows the volume mount point names. Unfortunately we don’t have a scripted way to pull the data out of the blg file at this time, so this is a manual process.

    image

    For the above information, assuming all the _DATA volume mount points contained Exchange databases, you would start entering data in the question as the following:

    S:\SG01_SG04_DATA;S:\SG05_SG08_DATA;S:\SG09_SG12_DATA

    You get the idea… Just remember that all drives and mount points need to be separated by a semicolon and you should be good.

    Now it’s important to note that we have included a catch all Generic Disk analysis for incase any of the drive questions were not answered. So, if you ran a report and forgot to enter any drive information in, you will get an output similar to the following in the Table of Contents. This may lean you towards an actual disk related problem due to the amount of times an analysis crossed over a threshold. You will see that there were 527 disk samples taken in this perfmon and all Database, Log and Page file drives have the same alert count in them. It is actually normal that this is happening because we will now log a tripped threshold for each drive type specific analysis and have fallen through to the Generic Disk analysis. If you see this, then go directly to the Generic analysis to look at your disk analysis.

    image

    For each one of the thresholds that tripped in which drive letters were not entered, you will see an entry in the table similar to the following stating that no data was entered in the questions. You can either ignore this and view the Generic Disk analysis or re-run the analysis with the questions correctly answered, providing a more accurate analysis.

    image

    The same holds true for the Hub Transport and Client Access server disk analysis.

    Another question that was added to the Mailbox server role analysis was ClientMajority which specifies if the majority of the clients are running in cached mode or not. This setting directly affects the analysis of the MSExchange Database(Information Store)\Database Cache % Hit counter.

    image

    Database Cache % Hit is the percentage of database file page requests that were fulfilled by the database cache without causing a file operation, i.e. not having to read the Exchange database to retrieve the page.  If this percentage is too low, the database cache size may be too small.

    Here are the thresholds that were added for this particular analysis.

    • WARNING - Checks to see if majority of the clients are in Online Mode and if the Database Cache Hit % is less than 90%
    • ERROR - Checks to see if majority of the clients are in Online Mode and if the Database Cache Hit % is less than 75%
    • WARNING - Checks to see if majority of the clients are in Cached Mode and if the Database Cache Hit % is less than 99%
    • ERROR - Checks to see if majority of the clients are in Cached Mode and if the Database Cache Hit % is less than 85%

    The last question that was added was CCRInUse. This question helps differentiate analysis for CopyQueueLength and ReplayQueueLength between CCR and LCR replication since we have different recommended values for each configuration.

    image

    There was also an update for the HUB and HUB/CAS role threshold files where you can now specify drive information for both the Exchange Transport queue file drives and the Page File Drives.

    image

    Additionally the 64bit question was removed from all the Exchange Server 2007 PAL threshold files, since Exchange 2007 is only supported in production on a 64bit Windows operating system.

    It’s probably also important to point out that we’ve managed to get all of the thresholds corrected and updated and a number of new analysis rules added however we haven’t necessarily managed to update or include all of the associated rule and troubleshooting text that goes with each analysis rule. As we get some more time these will be updated, for now it will be more important to migrate all the PAL 1.0 Exchange content to the new PAL 2.0 that will be available sometime in the near future.

    To download the latest XML files, go the XML update page here or direct download here

    If you are interested in the other changes that were made to the 3 threshold files here they are below:

    MBX:

    • Change RPC slow packets (>2s) more than 0 to only trigger on average value as per online documentation.
    • Updated RPC Average Latency to warn on 25ms average (as per online guidance), warn on 50ms max and critical on 70ms max or average.
    • Added MSExchangeIS\RPC Client Backoff/sec to warn on greater than 5.
    • Modified MSExchangeIS Client: RPCs Failed: Server Too Busy to only create a warning for greater than 50 and removed the error alert for greater than 10 seeing as this counter is mostly useful to know if Server Too Busy RPC errors have ever occurred (since it is calculated since store startup)
    • Modified MSExchangeIS\RPC Requests to warn on 50 instead of 70 as higher than 50 is already too high and to then error on 70.
    • Removed the MSExchangeWS\Request/Sec counter from Web Related as MSExchangeWS does not exist on a MBX server.
    • Added _Total to instance exclusions for disk analysis.
    • Added _Total to instance exclusions for MSExchange Search Indices counters.
    • Added _Total to instance exclusions for various other counters.
    • Created a generic disk analysis for when either the log drives, database drives or pagefile drives is unknown.
    • Added in a warning alert threshold for Calendar Attendant Requests Failed when it is greater than 0.
    • Removed the System process exclusion for Process(*)\% Processor Utilization analysis as we do want to know if this is using excessive amounts of CPU as it can indicate a hardware issue
    • Configured the Privileged Mode CPU Analysis to work on _Total instead of individual processors.
    • Updated the Privileged Mode CPU Analysis to not continue if the Total Processor Time is not greater than 0, previously it did not continue if the Privileged Mode Time was not greater than 0. This meant we could get a divide by 0.
    • Updated the Privileged Mode CPU Analysis to warn on greater than 50% of total CPU and Total CPU is between 20 and 50
    • Added a warning alert for Processor\% User Time to fire if % User Time is greater than 75% as per online guidance.
    • Corrected Memory\Pages/Sec text of "Spike in pages/sec - greater than 1000" to read "greater than 5000"
    • Added IPv4\Datagrams/sec and IPv6\Datagrams/sec
    • Added TCPv4\Connection Failures and TCPv6\Connection Failures
    • Added TCPv4\Connections Established and TCPv6\Connections Established
    • Added TCPv4\Connections Reset and TCPv6\Connections Reset and set a threshold for both to warn on an increasing trend of 30
    • Added TCPv4\Segments Received/sec and TCPv6\Segments Received/sec
    • Updated MSExchange Database(Information Store)\Version buckets allocated to alert on greater than 11468 instead of 12000 i.e. 70% of 16384.
    • Collapsed all MSExchange ADAccess counters under MSExchange ADAccess category
    • Added _Global_ as an exclusion to .Net Related\Memory Leak Detection in .Net
    • Added _Global_ as an exclusion to .Net Related\.NET CLR Exceptions / Second
    • Updated .Net Related\.NET CLR Exceptions / Second to warn on greater than 100 exceptions per second.
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated Network Packets Outbound Errors to alert on greater than 0 instead of 1
    • Updated Network Utilization Analysis to error on greater than 70%
    • Updated Memory\Page Reads/Sec to only warn on 100 average instead of 100 max, other thresholds of 1000 and 10000 still remain the same
    • Updated Memory\Pages Input/Sec's warning to read "More than 1000 pages read per second on average"
    • Updated Memory\Pages Input/Sec to not warn on max of 1000 (it is too low to warn on 1000 max)
    • Updated Memory\Pages Output/Sec's warning to read "More than 1000 pages written per second on average"
    • Updated Memory\Pages Output/Sec to not warn on max of 1000 (it is too low to warn on 1000 max)
    • Added a content indexing section for the Exchange 2007 indexing counters
    • Added analysis for ExSearch processor usage to warn on more than 1% and error on more than 5%
    • Added analysis for MSFTEFD* processor usage to warn on using more than 10% of the Store.exe processor usage
    • Updated .Net CLR Memory\% Time in GC to include * for process and exclude _Global. Removed 5% threshold and made 10 and 20% threshold warning and error conditions respectively.
    • Updated MSExchange Replication\ReplayQueueLength and CopyQueueLength Counters to exclude _Total
    • Modified MSExchange ADAccess Processes(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max
    • Added threshold alerts for MSExchange ADAccess Processes(*)\LDAP Read Time
    • Added threshold alerts for MSExchange ADAccess Domain Controllers(*)\LDAP Read Time
    • Modified MSExchange ADAccess Domain Controllers(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max and only if number of Search Calls/Sec is greater than 1
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds

    HUB:

    • Removed the System process exclusion for Process(*)\% Processor Utilization analysis as we do want to know if this is using excessive amounts of CPU as it can indicate a hardware issue
    • Configured the Privileged Mode CPU Analysis to work on _Total instead of individual processors.
    • Updated the Privileged Mode CPU Analysis to not continue if the Total Processor Time is not greater than 0, previously it did not continue if the Privileged Mode Time was not greater than 0. This meant we could get a divide by 0.
    • Updated the Privileged Mode CPU Analysis to warn on greater than 50% of total CPU and Total CPU is between 20 and 50
    • Added a warning alert for Processor\% User Time to fire if % User Time is greater than 75% as per online guidance.
    • Removed Process\%Processor Time from the Process category as it is already included as part of Processor\Excessive Processor Use By Process
    • Modified Memory\Available MBytes to warn on less than 100MB and critical on less than 50MB
    • Added threshold alerts for Memory\% Committed Bytes in Use to warn on greater than 85% and critical on more than 90%
    • Added Memory\Committed Bytes
    • Corrected Memory\Pages Input/Sec to warn on greater than 1000 as it was set to warn on greater than 10
    • Added threshold alert for Memory\Pages Output/Sec to warn on greater than 1000
    • Corrected Memory\Pages/Sec text of "Spike in pages/sec - greater than 1000" to read "greater than 5000"
    • Modified Memory\Transition Pages Repurposed/Sec to warn on spikes greater than 1000 instead of 100
    • Modified Memory\Transition Pages Repurposed/Sec to critical on averages greater than 500 instead of 1000
    • Modified Memory\Transition Pages Repurposed/Sec to critical on spikes greater than 3000 instead of 1000
    • Added IPv4\Datagrams/sec and IPv6\Datagrams/sec
    • Added TCPv4\Connection Failures and TCPv6\Connection Failures
    • Added TCPv4\Connections Established and TCPv6\Connections Established
    • Added TCPv4\Connections Reset and TCPv6\Connections Reset and set a threshold for both to warn on an increasing trend of 30
    • Added TCPv4\Segments Received/sec and TCPv6\Segments Received/sec
    • Modified MSExchange ADAccess Processes(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max
    • Added threshold alerts for MSExchange ADAccess Processes(*)\LDAP Read Time
    • Added threshold alerts for MSExchange ADAccess Domain Controllers(*)\LDAP Read Time
    • Modified MSExchange ADAccess Domain Controllers(*)\LDAP Search Time to warn on over 50 average and 100 max and critical on over 100 average and 500 max and only if number of Search Calls/Sec is greater than 1
    • Added MSExchangeTransport Queues(_total)\Messages Queued for Delivery Per Second
    • Removed all MSExchangeMailSubmission Counters as they are only on MBX
    • Removed MSExchange Database ==> Instances Log Generation Checkpoint Depth - MBX as this was for MBX role
    • Modified MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\Log Threads Waiting to warn on greater than 10 and error on 50
    • Added an error alert for MSExchange Extensibility Agents(*)\Average Agent Processing Time (sec) to error on greater than 60 average
    • Collapsed all Database counters under MSExchange Database category
    • Collapsed all MSExchange ADAccess counters under MSExchange ADAccess category
    • Moved Process(EdgeTransport)\IO* counters into EdgeTransport IO Activity category
    • Updated MSExchange Database(*)\Database Page Fault Stalls/sec to MSExchange Database(edgetransport)\Database Page Fault Stalls/sec
    • Updated MSExchange Database ==> Instances(*)\I/O Database Reads Average Latency to MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\I/O Database Reads Average Latency
    • Updated MSExchange Database ==> Instances(*)\I/O Database Writes Average Latency to MSExchange Database ==> Instances(edgetransport/Transport Mail Database)\I/O Database Writes Average Latency
    • Added _Total exclusions where necessary
    • Removed 64bit question
    • Added a question for pagefile drive
    • Added edgetransport as an exclusion to Memory\Memory Leak Detection
    • Added _Global_ as an exclusion to .Net Related\Memory Leak Detection in .Net
    • Added _Global_ as an exclusion to .Net Related\.NET CLR Exceptions / Second
    • Updated .Net Related\.NET CLR Exceptions / Second to warn on greater than 100 exceptions per second.
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated Network Packets Outbound Errors to alert on greater than 0 instead of 1
    • Updated Network Utilization Analysis to error on greater than 70%
    • Updated Memory\Page Reads/Sec to only warn on 100 average instead of 100 max, other thresholds of 1000 and 10000 still remain the same
    • Updated Memory\Pages Input/Sec's warning to read "More than 1000 pages read per second on average"
    • Updated Memory\Pages Input/Sec to not warn on max of 1000 (this is too low to warn on 1000 max)
    • Updated Memory\Pages Output/Sec's warning to read "More than 1000 pages written per second on average"
    • Updated Memory\Pages Output/Sec to not warn on max of 1000 (this is too low to warn on 1000 max)
    • Updated .Net\CLR Memory\%Time in GC to include * for process and exclude _Global. Removed 5% threshold and made and 20% threshold warning and error conditions respectively.
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds 

      CAS:

    • Created a new CAS file based off of the common updates in the new MBX xml
    • Updated ASP.NET\Request Wait Time to warn on greater than 1000 max and error on 5000 max
    • Updated ASP.NET Applications(__Total__)\Requests In Application Queue to error on 3000 rather than 2500
    • Updated MSExchange Availability Service\Average Time to Process a Free Busy Request to warn on 5 avg or max and error on 25 avg or max
    • Updated MSExchange Availability Service\Average Time to Process a Cross-Site Free Busy Request to warn on 5 avg or max and error on 25 avg or max
    • Updated MSExchange OWA\Average Response Time to warn on max greater than 100 and more than 2 OWA requests per second on average
    • Updated MSExchange OWA\Average Search Time to warn on max greater than 31000
    • Updated MSExchangeFDS:OAB(*)\Download Task Queued to warn on avg greater than 0
    • Moved Process(*)\IO Data Operations/sec into an IO Operations category as it is not just disk related
    • Moved Process(*)\IO Other Operations/sec into an IO Operations category as it is not just disk related
    • Updated ASP.Net Requests Current to warn on greater than 1000 and error on greater than 5000 (max size it can get to is 5000 before requests are rejected)
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds

    HUB/CAS:

    • Combined both HUB and CAS XMLs for analysis of combined roles.
    • Added all Store Interface counters.
    • Added MSExchange Store Interface(*)\ConnectionCache out of limit creations and MSExchange Store Interface(*)\ConnectionCache active connections counters and thresholds

    Perfwiz for Exchange 2010

    $
    0
    0

    NOTE: This version of Perfwiz has been replaced by a newly written script that is talked about in http://blogs.technet.com/b/mikelag/archive/2010/07/09/exchange-2007-2010-performance-data-collection-script.aspx

    1. Download the appropriate version of Perfwiz for your server

      How to download
      To download these XML files to your computer, right click the file of your choice, select Save Target As... , and then save it to a directory location of your choice on your Exchange Server

      Role Based
      Use these as a high level look in to how the server is performing and if you need to branch out with more counters, use the Full Counter/Instance set below.

      Exchange_2010_Perfwiz-MBX.xml
      Other roles coming soon....

      All Counters/All Instances
      Use this counter set at your own discretion as this could potentially cause performance degradation on your server trying to log this amount of counters.

      Exchange_2010_Perfwiz-Full.xml
    2. Open Performance Monitor
    3. Expand Reliability and Performance and then expand Data Collector Sets
    4. Right click User Defined, Select New, and then Data Collector Set
    5. Enter a unique name for this Data Collector set (ie. ExPerfwiz), select Create from a template (Recommended) and then click Next
    6. Select the Browse button, navigate to the XML file that was saved in Step 1, select Open
    7. Select Next on the next screen
    8. Enter in a root Directory of where you would like to store the performance log files. Click Next
    9. If you need to run this performance log under different credentials, enter it on this page. Click Finish

    The effects of paged pool token memory usage and Outlook 2007’s shared folders feature

    $
    0
    0

    In Exchange 2003, we are bound by 32-bit architecture memory limitations and paged pool memory just so happens to be one of them. This is a very common issue that we in Exchange support see is paged pool memory depletion related to excessive Token usage.

    For the most part, paged pool memory troubleshooting for Exchange 2003 can be found in http://support.microsoft.com/kb/912376. This article goes over what access tokens are, what symptoms are seen when paged pool memory usage is low, how token sizes are generally calculated, and a script that will allow you to export group memberships for users that are mail enabled. It does not go over some of the other reasons why paged pool memory may increase on any given Exchange 2003 server.

    There is also a great document “Addressing Problems Due to Access Token Limitations” at http://download.microsoft.com/download/8/f/3/8f36dfe4-47d0-4775-ad5a-5614384921aa/AccessTokenLimitation.doc that goes over Access Tokens in much greater detail.

    If you had used the scripts in 912376 to export the users group membership, you might end up with data in Excel similar to the screenshot listed below.

    image

    Notice that we have a slew of users that have really high group membership. This can and will affect overall Token sizes when connecting to an Exchange server.

    If you ever wanted to calculate the approximate size of a users access token, the formula as listed in 912376 is the following:

    [12 x number of user rights] + [token overhead] + [44 x number of group memberships] = token size in bytes

    Trying to find the maximum amount of security groups a user can belong to before we double the size, we would come up with the following (4096-500 / 44 = 81.70). This means that a user needs to be a part of a maximum of approximately 80 security groups before the token size doubles to 8192. This a rough guesstimate, but this should be enough to estimate overall token sizes for users in general terms if you wanted to. Microsoft does not recommend more than 100MB of paged pool memory for token allocations as it may impact Exchange and other applications installed on the server.

    One of the reasons that I believe that there are many security groups that are mail enabled in any organization is because this is a default behavior of creating groups in Active Directory Users & Computers (ADUC). When you create a group in ADUC, the default type is a security group, not distribution, so over time, you can end up with users that are a part of many groups that are of type security which is a bad thing. If you let this go long enough, the users access token can bloat to high values causing paged pool memory issues caused by excessive Toke usage.

    One of the first things you need to do is ensure that the hotfix in http://support.microsoft.com/kb/912480 is applied on the Exchange server. This hotfix decreases the amount of connections that Outlook users make to an Exchange server.

    Each connection that a client makes to a server will increase overall token usage for that user due to that fact that we need to expand all group memberships to ensure that you have permissions to the requested resource and you don’t have any specific deny in any of the many groups they are a part of. If you establish a lot of TCP connections to a single process such as the store.exe process, you could exhaust paged pool memory resources on the server very easily every time a user needs to verify that they have permissions to access resources in users mailboxes. The more connections you have open, the more token usage that will be consumed.

    So now I would like to bring to light a new feature in Outlook 2007 to help cache shared folders such as Calendars, Contacts, Tasks, and Notes. This specific feature is Download shared folders (excludes mail folders) which is turned on by default in any new Outlook 2007 profiles. This allows easier and faster access to other users mailbox folders since the data will be cached locally in their Outlook profile. Here is what the option looks like in a users outlook profile.

    image

    For each shared folder that is cached in a user profile, an additional connection will be opened to the Exchange server to sync that data to the users OST file. If a user is opening a lot of other folders, then you will see many connections to the Exchange server. To make it even better, http://support.microsoft.com/kb/955572 introduced a new feature where you can cache other users mail folders such as the Inbox by adding a value to the registry on each client computer called CacheOthersMail. Once this is implemented, any user can now have the ability to cache any other users mailbox folders should they have access to them.

    As you can see, this can significantly increase the amount of connections to any given Exchange server, thus increasing overall token usage. If you have users that have high group memberships, then it may be advantageous to disable these features. This can be accomplished organization wide via Group Policy by disabling the “Synchronizing data in shared folders” option listed in http://technet.microsoft.com/en-us/library/cc179175(office.12).aspx. Disabling this feature will now cause the Outlook clients to make direct RPC connections to the Exchange server since that folder data is no longer cached locally, so it may take just a bit longer to pull up the data they are looking for.

    If you are migrating your users from Outlook 2003 to Outlook 2007 and are noticing increased page pool Toke memory usage on an Exchange server, then this may be one of the reasons for it.

    One way to determine if the outlook client has many connections open to the Exchange server is to run a netstat –ao on the Exchange server and pipe it to a text file. If you view the list of connections from single client IP addresses to the Store process, which in this case was running under port 1674, you may see many established connections as shown below.

    TCP    EXSERVER:1674       CLIENTCOMPUTER:3196     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3202     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3205     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3210     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3219     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3234     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3235     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3238     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3239     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3241     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3246     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3252     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3253     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3257     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3258     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3260     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3261     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3263     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3264     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3277     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3278     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3280     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3282     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3283     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3284     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3285     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3286     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3293     ESTABLISHED     4944
    TCP    EXSERVER:1674       CLIENTCOMPUTER:3301     ESTABLISHED     4944

    If this type pattern is seen for various clients, then this feature may be affecting your Exchange servers paged pool memory usage which could potentially lead to unresponsive servers.

    In one of the cases that I had worked, disabling this feature via group policy dropped the runtime paged pool memory usage 40-50MB. This is a huge memory savings that may keep your server afloat until you migrate to Exchange 2007 or Exchange 2010. Of course, those versions of Exchange are not susceptible to these type nonpaged or paged pool memory issues since we have moved away from the 32-bit platform and will only support the 64-bit versions of Exchange in production.

    Once you have migrated these users to later versions of Exchange, you can then remove the group policy to enable this feature again.

    Of course there are many other reasons for excessive Toke usage on an Exchange server, but this is just one of those new features that may bite you in the end.

    Until next time…

    Exchange Server Profile Analyzer updated!!

    $
    0
    0

    For any of you folks that might have created a hair ball on the floor by pulling your hair out trying to obtain data from the Exchange Profile Analyzer (EPA) tool, there is good news that a new version has just been released and should help resolve the majority of the issues that you might have run in to previously. One of the updates that was a major blocker is if you had changed your default recipient policy to include variables such as %s.%g@domain.com, then this update will resolve this problem for you amongst a slew of other enhancements.

    Please note that this is still the DAV version of the tool and will not run against Exchange 2010 servers, but currently this tool should only be run against Exchange 2003 or Exchange 2007 installations in the hopes that you can obtain the user profiles for your users to allow proper planning/sizing of your Exchange 2010 servers.

    You can download the latest binaries from the following locations.

    32-bit - http://www.microsoft.com/downloads/details.aspx?familyid=8F575F60-BD80-44AA-858B-A1F721108FAD&displaylang=en
    64-bit - http://www.microsoft.com/downloads/details.aspx?familyid=C009C049-9F4C-4519-A389-69C281B2ABDA&displaylang=en

    Enjoy!!

    Perfmon data viewing nuance…

    $
    0
    0

    Recently, I have been running in to a perfmon nuance that I would like to call out. On any given performance troubleshooting day which is almost every day, I open up Perfmon, select the BLG file of choice and then add the appropriate counters that I would like to view.

    In this instance, I was presented with the following view.

    image

    The odd thing is that all instances are linear in fashion for both increasing or decreasing counter values. Something surely is not right. As you can see, this is not helpful at all.

    Since I know that this perfmon has data in it, I needed to find a way to view this information as it was crucial we found root cause of the issue with the data provided.

    One of the reasons that this could happen if there are long gaps of uncollected data in the perfmon file as data it being appended to the existing BLG file. So if I started a perfmon up for 1 second and then immediately stopped it, I would collect one sampling of data. If my perfmon data collection is setup to append data to the existing blg file, then the next time the perfmon collected is started it will append the data to the existing BLG file. Since we average out the data over the last 100 samples, you can clearly see that the averages here will be way off and will show data similar to the above screen capture.

    So how do I overcome this? Well, first, you need to find out where the data gaps are in the perfmon file. To do this, you can use the relog utility that comes with Windows to dump out the blg file to csv format. To make this easier, I use the \Processor(_Total)\% Processor Time counter to export this data on as that counter is normally collected in every perfmon file. The syntax of the command is below along with the associated result information.

    image

    In the above case, we can see that the blg contained information for 14 hours, but only 29 samples were taken. Based on that alone, the numbers do not add up.

    If you open the log.csv file in notepad, you should get an output similar to the following. As highlighted below, you will see the actual gap in data collection which was throwing off the view in perfmon. This output will also give us the time interval that this perfmon was taken which was every 30 seconds. This time interval is extremely important to me when viewing perfmon data for any Exchange Performance related issues.

    image

    The first data sampling is what we don’t need here, so we need to change our Time Range in Perfmon by moving the slider bar over to the right approximately one hour before the data capture

    image

    Once you have the slider where you want it, right-click the slider bar and then select Zoom to as shown below as shown in the following pictorial.

    image

    What you should now see is some resemblance of performance data.

    image

    What you now want to do to remove the whitespace surrounding the capture is to left-click and drag to the right the time slider until you have highlighted the data you want to look at. Once selected, right-click the window and select Zoom To to zoom in to that time frame.

    image

    What you should now see if the actual data you want to look at.

    image

    Another way of being able to view this data is to use the relog utility to export out specific timeframes in the log. Since we know the start and end times of the data we want to view, we can simply run the following command to export data between specific times to an export.csv file. Once you pull up the data in perfmon for this export csv file, you will have the same view as the one listed above.

    image

    I hope this helps you decipher whether or not a perfmon file has been corrupted or the data just does not look correct for whatever reason.

    Happy performance viewing!!

    Audit Exchange 2007 SP2 Auditing

    $
    0
    0

    There have been a few cases that have been coming through the support channels stating that auditing is not working for whatever reason. After reviewing the cases, we have found that this is due to users or groups in the Configuration Partition of Active Directory that have been granted Full Access to the containers in the tree or having the All Extended Rights permission. Having these permissions will essentially bypass any auditing events from occurring.

    Let’s take a step back for a moment. When applying Exchange 2007 SP2 in an organization, the schema is extended with the right ms-Exch-Store-Bypass-Access-Auditing. If a user previously has been granted Full Control within the Configuration tree, that user or group will then take on an allow right for this Bypass auditing right, thus being exempt from being audited. This is not good for compliance reasons and the end results will not contain audit entries for all users that can/will be accessing mailboxes.

    The other problem is that there is currently no way to lock down the ability for any administrator to add a specific Allow on an object in Active Directory for this bypass right, thus excluding them from being audited.

    Listed below are the *default* groups that have the Bypass Auditing right due to various permission settings:

    • Domain Admins
    • Enterprise Admins
    • Exchange Organization Administrators

    Once the Schema has been extended, there are 5 places to add auditing bypass entries in the configuration container in Active Directory as shown below.

    • Exchange Organization Container
    • Exchange Administrative Group Container
    • Exchange Servers Container
    • Exchange Server object
    • Exchange Database object

    When auditing is not working as expected, it would be a rather tedious process to check permissions throughout the configuration tree for these objects where the bypass extended right has been set.  I have created a powershell script (AuditBypassPerm.ps1) that should help export permissions for each of these objects to make your job finding permissions problems that much easier.

    Before I go over the script, I want to describe some of the terms that you will need to know when looking through the output of this script. Objects in Active Directory can be assigned using Access Rights using System.DirectoryServices.ActiveDirectoryRights[]. These Access rights control what type of permission a user or group has on a particular object. A listing of all the available Access rights can be found at http://msdn.microsoft.com/en-us/library/system.directoryservices.activedirectoryrights.aspx.

    The three main rights that we are concerned with relationship to this auditing are the following:

    • ms-Exch-Store-Bypass-Access-Auditing = Bypass Exchange Access Auditing in the Information Store
    • GenericAll = The right to create or delete children, delete a subtree, read and write properties, examine children and the object itself, add and remove the object from the directory, and read or write with an extended right.
    • ExtendedRight = A customized control access right. This can be used to specifically apply a particular right such as ms-Exch-Store-Bypass-Access-Auditing or this could mean that you may have an allow right for All Extended Rights as shown below. All Extended Rights mean just that, All rights including the “Bypass Exchange Access Auditing in the Information Store” right.

       image

    So with that said, these are the three main rights that we need to concentrate on when we are trying to find a needle in the haystack. The next piece that we also need to be cognizant about is whether there is a specific deny or one of the 3 rights is being inherited from some other object in the Configuration tree.

    In my example, I used an account called AuditTest1 to show how one would troubleshoot something like this. I granted a deny to the bypass right at the organization level so that this users account would be audited, but then at the Database object level, I granted the All Extended Rights right for this account. What this essentially did is to bypass auditing for this user at the database level, thus logging no events for any mailbox access for mailboxes on that database.

    The syntax for the script is as follows:

    .\AuditBypassPerm.ps1 <MBXServername>

    After running the script, you will get output similar to the following

    image

    Once the script has completed, notepad will open up a text file which will provide information similar to the below pictorial. This allows you to visually see at each of the 5 objects where we can set the bypass permissions what rights were assigned to this user.

    image

    The key takeaway here is to view what groups/accounts have the ms-Exch-Store-Bypass-Access-Auditing, GenericAll , or ExtendedRight right set on them and if it is set, to determine at what level in the Configuration Tree a potential override has been set that would prevent auditing specific accounts from being audited.

    If there is an Inherited permission that you cannot view what rights has been set at each of the 5 object levels, then you will need to open up ADSIEdit.msc and then walk the tree up from that object until you find the object in which permissions have been changed.

    So that sounds great, but what happens when you have a user that is not listed in the tree, but is still not being audited? The main reason for this is that this user is a part of a group that could have permission in to the tree to have one of these 3 rights.

    Since Powershell V1 does not have a direct way to view a users group membership, I created another really small script to list out the groups that these users are a part of. This is not the same format that is listed above, but will help provide a general understanding of what groups the user is a part of that you can then compare to the output listed above for your troubleshooting efforts.

    The syntax for the group membership script is as follows:

    .\GetUserGroups.ps1 <Username>

    Note: This can be in simple format or domain\username format as shown below.

    image

    These set of scripts can be downloaded here

    I hope this helps untangle the inability to get auditing working for some users/groups as reviewing permissions is sometimes a very tedious task.

    Another question that comes up is how do I map the Message ID that is listed in the event when accessing the message to an actual message in a users mailbox? A sample event is listed below with the relevant parts highlighted.

    image

    So from that, we can see that we have a message ID of <6B83547937704D4EB0EFA4327EF0DEC82D8F92EC36@MIKELAGDC.mikelaglab.com> and this message was opened in the folder /Calendar.

    For every message on an Exchange 2007 server, we generate a unique Message ID that is stamped on the PR_INTERNET_MESSAGE_ID MAPI property of each message. This MAPI property is also equivalent to 0X1035000E.

    With MFCMAPI, you can find this message rather easily by creating a content restriction in the mailbox. To do this, you would need to create a MAPI profile for that user specifically on an administrative workstation or use a MAPI profile that has full access to all mailboxes.

    IMPORTANT: While going in with an administrative account performing these operations, if the message is touched in the mailbox while using MFCMAPI, an auditing event will be logged to the Exchange Auditing log. If you don’t want to log any events while doing your investigation, it may be best to logon with an approved account that has the Bypass Auditing right so that whatever actions you are taking inside a users mailbox is not audited or use an account that is specific to finding what object was audited.

    Once you open the mailbox in MFCMAPI, you would then need to navigate to and open the folder that was listed in the auditing event as shown above. In this case, it was the Calendar folder. After the folder is opened, click on Table on the Menu and then select Create an Apply Content Restriction.

    image

    Next, we need to specify what Property Tag we are looking for which in this example is 0X1035000E. Once this property tag number is entered, you will see that the Property Name field will show what we want to filter on. Click OK when you are finished.

    image

    In the Property Value field, enter <6B83547937704D4EB0EFA4327EF0DEC82D8F92EC36@MIKELAGDC.mikelaglab.com> including the < > characters as shown below. This is very important that the < > characters are entered, otherwise, the restriction will not return the message. Click OK when you are done.

    image

    The result should be the message that you are looking for.

    image

    That is all for now….

    Happy Auditing!!

    Viewing all 60 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>