We're seeing poor performance across the board for our internal installation, and have a number of fronts we think we can make improvements on... but one question I had was what to expect for cache effectiveness values on:
/clearspace/admin/system-cache.jsp
I'm seeing numbers all over the board, but only rarely does clearspace mark any of them w/ the red background. The documentation seemed to suggest 90% cache effectiveness as a target; some of our caches are at ~17%, etc. But often-times, the caches aren't even full.
Thoughts?
Hi bretm,
There are a couple of things that can cause a document cache to be ineffective:
You can increase the lifespan for caches by editing the applicable value in the jive_startup.xml (requires restart). The values are stored as the number of milliseconds content should live in the cache. It may be appropriate in your case to bump them up for the caches that you see are ineffective. You also want to make sure that your caches have plenty of room--50 to 60% full is a good target to shoot for. Clearspace will highlight caches that are both ineffective and nearly full.
Would you mind if I made this into a public case? I'm sure other customers would benefit from this knowledge as well.
Thanks!
Nope, that's fine, make it public. But I'm still wondering why clearspace isn't flagging more caches than it is... can you provide a bit more clarity on when clearspace thinks a cache is a problem?
I'm seeing both full caches, and ineffective caches *NOT* get flagged... so I'm wondering if I need to tinker w/ those at all, or just leave them be...
Here's the exact formula that we use:
lowEffec = (hits > 500 && hitValue < 85.0 && freeMem < 20.0);
So, if a cache has over 500 hits and is less than 85% effective and is over 80% full, it will be flagged. If these conditions don't hold true, then you don't need to adjust the allocated memory for the cache. However, you may still want to increase the lifespan for your caches if you have plenty of memory to spare and don't want things to fall out of the cache.
Austen,
Can you point me at the right documentation for the jive_startup.xml cache lifetime parameter? Can't seem to find it anywhere...
Thanks!
Unfortunately, there isn't any documentation for this, but it should be fairly straightforward. For instance, if you want to modify the Document cache, search for this in your jive_startup.xml (please note, size may be different from what is shown below):
<DocumentCache>
<size>33554432</size>
<maxLifetime>43200000</maxLifetime>
</DocumentCache>
To increase the lifespan of the cache, you would need to increase the value of maxLifetime. The current setting is for 12 hours (12h * 60min/h * 60s/min * 1000ms/s = 43200000 milliseconds).
Thanks,
Austen
Our jive_startup.xml doesn't have any of that... here's what I'm seeing:
<cache>
<clustering>
<enabled>true</enabled>
</clustering>
</cache>
Do you have a raw list of all configurable properties somewhere? Or can I just use the list of caches in the system-cache.jsp page?
It looks like you haven't changed the default settings at all, so here's what you need to do:
Once you have customized the cache settings, all of the settings will be written to your jive_startup.xml and you should be able to locate the appropriate section for updating the cache lifetime.
Thanks!
Austen
Austen,
I'm not sure what it is about our environment, but I'm not seeing that behavior around jive_startup.xml. We control this file via puppet, so it's possible any saved changes are lost. Please provide an example jive_startup.xml file w/ the relevant cache fields so that I can move forward testing some fixes.
Thanks,
--Bret
Bret,
I'm reluctant to provide this because the sizes on my caches almost certainly vary from what you have, but here is a copy of what I have locally. You must copy/paste the entire <cache> section into the jive_startup.xml on each node. Please let me know if you have any more questions on this.
Thanks,
Austen
Hi Austen,
I always wonder why there is a maxLifetime property for cached objects. Clearspace / SBS does as far as I know invalidate objects which were modified and this works clusterwide. So I'd like to set the cache lifetime to -1 /maxInt. I wonder whether this is a bad idea.
Do you use weak hashmaps for the caches? This would allow one to set also the size to very big values without getting memory problems.
LG
A value of zero or less means unlimited. Not a bad idea, but it will mean that your caches consume more memory over time, so you'll have to plan accordingly.
We use Oracle's Coherence for our distributed caching layer. Once the cache is full, items will be evicted. The main concern when setting unlimited cache lifespans is the amount of memory that they will consume over time.
Austen,
I understand your concern, and thank you very much for providing the sample file. We're probably going to turn it into a general template, and use our existing cache sizes as the parameters ![]()
Here's another question, though: our two biggest performance problems at the moment are /clearspace/index.jspa and /clearspace/docs/* ... is there an easy way to tell which of the caches most affect those paths, and what kind of max lifetime would you look to use for each? We've got a fair bit of room memory-wise to increase cachine (only using about 2gb on an 8gb machine for the entirety of the container), but I'm worried we'll end up caching something that really needs to be dynamic.
Thanks!
Are you sure that the caching is really the problem here? Have you done any thread dumps to confirm that the DB traffic is the bottleneck?
The only way to determine what caches are used for a given request would be to trace through each line of code and identify which caches are being used. Cache management is primarily done in the Manager layer, though there are a few other places you'll find this logic in the application. One action may use several managers and each manager may have zero to many caches, not all of which may be used depending on the nature of the request.
We're not at all sure it's db-related; we're trying to track down as many avenues as possible.
Here's a scatterplot of request completion time in microseconds as measured by httpd against client IP address, on a logarithmic scale. Pretty much across the entire network topology, we're seeing some pretty horrible outliers on certain document pages, as well as main-page loads.
Blue == successful HTTP 200 return codes
Green == 401 Authorization required
Violet == client errors
Red == server errors
Greys == Not Modified
Our interpretation of this is as follows:
1) Our performance issues are not network-topology based. Outliers are spread pretty much across the board
2) We're not seeing good caching behavior; we're looking at addressing this via mod_expires and moving static content to our proxy layer
Load and cpu utilization on the mysql machine is incredibly light, and we're not near our db connection pool limit.
The only slightly noteworthy thing is that cpu utilization on our jboss server jumped a bit after upgrading to 2.5.6 from 1.10.5:
Memory on the app servers seems fine:
Threads look ok:
Anything jumping out at you? We have *got* to get the outliers down from where're they're at to come close to our internal SLA.
Bret,
Not a lot jumping out from what you have provided. The thing that would help the most here would be to have some thread dumps from your application server at the time you are experiencing the slowness. If you can get 3-5 thread dumps during a period where you are replicating the problem and upload them to this case, that will give us the information that we need to identify what is causing the performance problem. If you can provide a correlated thread level top output with each thread dump, that would be even better. Since the load on the DB is light, I have a feeling that your performance problems are not with the caching layer, but that remains to be determined.
Thanks,
Austen
Is there any other profiling data we can provide? Timing the thread-dumps to occur during the outliers is going to be non-trivial.
Given the following data on poorly performing urls, what would you tend to tackle if not the caches?
| Document Load Time(s) | Main page completion time (s) | |
| median | 8.11 | 19.2 |
| average | 10.81 | 23.82 |
| max | 148.18 | 93.95 |
I guess I need some more information:
A series of thread dumps over the time that you are seeing the performance problem will give us a snapshot of what is happening in the system. Without these, we are forced to guess as to what the problem might be. From what I understand about the behavior of your system, this doesn't appear to be your problem. One additional thing that could help would be to identify long running queries in the DB. Can you grab a report of the top 10 longest running queries?
Thanks,
Austen
austen.rustrum wrote:
I guess I need some more information:
A series of thread dumps over the time that you are seeing the performance problem will give us a snapshot of what is happening in the system. Without these, we are forced to guess as to what the problem might be. From what I understand about the behavior of your system, this doesn't appear to be your problem. One additional thing that could help would be to identify long running queries in the DB. Can you grab a report of the top 10 longest running queries?
Thanks,
Austen
1) I've witnessed it, on the order of seeing the main page request for a small to medium-sized document take ~30s to complete. Haven't see the 190s+ ones personally, though.
2) Inside the corporate firewall, w/ a 1gb network connection and very low latency, here's what I'm seeing in firebug:
3) This is not running on a VM. It's running on 2 dedicated jboss app nodes... 8gb ram each, and I think 4 cores each. These are *not* slow machines, nor are they suffering from any other load.
4) Haven't broken it down to know if we're talking specific documents or not yet, but given the number of problem documents, I think it's unlikely; there are about 287 unique documents showing performance outside of our SLA target.
Definitely sounds like you have powerful machines, so something else is happening here and I don't think it is a caching problem. Do you get this same sort of performance when hitting the application servers directly? Do you know what the IDs for the 287 documents that have performance problems are? If so, can you take some thread dumps while accessing one of them? What sorts of customizations do you have? Any?
austen.rustrum wrote:
Definitely sounds like you have powerful machines, so something else is happening here and I don't think it is a caching problem. Do you get this same sort of performance when hitting the application servers directly? Do you know what the IDs for the 287 documents that have performance problems are? If so, can you take some thread dumps while accessing one of them? What sorts of customizations do you have? Any?
Hi,
the thread dumps do not look like standard Sun JVM dumps. I wonder which JVM you are using.
20 seconds to load a page is really bad. Did you see high CPU usage?
Does it perform better when you shutdown one server (no active cluster)?
LG
it2000 wrote:
Hi,
the thread dumps do not look like standard Sun JVM dumps. I wonder which JVM you are using.
20 seconds to load a page is really bad. Did you see high CPU usage?
Does it perform better when you shutdown one server (no active cluster)?
LG
It's a Sun JDK, the dumps were generated through the jmx-console and run through links -dump.
20s to load a page *is* really bad
And that's not even the worst of it.
We've not seen anything on cpu utilization or load that disagrees w/ the monthly graphs posted above. The box just seems to be sitting there ![]()
Performance doesn't appear to be very different w/ just one node vs two. For a long time, we were running just one node, and performance seemed to be just as bad (we don't have the same degree of data from that timeperiod though).
It's a Sun JDK, the dumps were generated through the jmx-console and run through links -dump.
20s to load a page *is* really bad
And that's not even the worst of it.
Agreed, that is abnormal and needs to be fixed. Can you please use the standard kill -3 method to take thread dumps? I have tools that can analyze thread dumps in this format to easily provide me a view into long running threads between sets of thread dumps. In pouring through these so far I haven't found much besides a quiet server sitting around for work. There is occasionally 1 thread that is actually doing something in Jive code, but other than that, everything is waiting for work. Very odd that you would see 20s for the application server to perform one request.
Adding an additional node won't give you any performance benefit due to the overhead in running the cluster. The only thing that you'll get is failover. It takes 3 or more nodes before you start to see a performance benefit.
Hi Austen,
as far as i can tell "the standard kill -3 method to take thread dumps" is out-dated. Today one wants to use "jstack" which is included in the JDK.
http://www.igniterealtime.org/community/docs/DOC-1033 is the JVM document used to identify Openfire/JVM issues.
Maybe one needs to install the JDK if only the JRE is installed. So one can also use JConsole: http://java.sun.com/javase/6/docs/technotes/guides/management/jconsole.html - this could be helpful.
https://visualvm.dev.java.net/ may be more useful as it makes better use of the MXBeans and may display blocked or waiting threads. Anyhow I did not use it yet.
LG
One would think that the Sun provided tools are better, but kill -3 actually provides more information and can be correlated with a thread level top output or ps output. Not the case with jstack. Plus, great tools like TDA have been built around the format provided by kill -3:
Perhaps your experience is different, but I've always been able to identify the source of the problem with a kill -3 thread dump.
Two additional things:
Thanks, looking forward to the results! If you could make a note of the specific URLs that were slow during your thread dumps, that will help me analyzing them. Also, I did mean spaces. We tend to use them synonymously around here, so sometimes I forget the correct term to use ![]()
austen.rustrum wrote:
Thanks, looking forward to the results! If you could make a note of the specific URLs that were slow during your thread dumps, that will help me analyzing them. Also, I did mean spaces. We tend to use them synonymously around here, so sometimes I forget the correct term to use ![]()
Thanks!
There are a lot of thread dumps to pour through and analyze here. Were they all taken during a time when there was a performance problem or were some of them taken during normal load? I'll have a look through them and get back to you on this tomorrow.
Here's my analysis from looking at the first thread dumps on application server #1:
console.log.dump1.1-4: Nothing happening
console.log.dump1.5-10: View communities action
console.log.dump1.11: ActionManagerProxy.getRecentActivity
console.log.dump1.12: RecentContentWidget and ActivityManagerProxy.getRecentActivity (same thread)
console.log.dump1.13: RecentContentWidget (same thread)
console.log.dump1.14: YourStatusUpdatesWidget and ActivityManagerProxy.getRecentActivity
console.log.dump1.15-19: ActivityManagerProxy.getRecentActivity (same thread)
console.log.dump1.19: CommunityProxy.getParentContainer
console.log.dump1.20: Execute Freemarker result
console.log.dump2.1-5: Nothing happening
console.log.dump2.6: Execute Freemarker result
console.log.dump2.7-19: Nothing happening
console.log.dump2.20: Downloading image
Thus far, the thread dumps on application server #2 seem to have a similar pattern emerging:
console.log.dump1.1: RenderUtils.renderSubjectToText, popularity determination worker (background task)
console.log.dump1.2: CommunityAction.getDocumentCount, RenderUtils.renderSubjectToText (same thread?), popularity determination worker
console.log.dump1.3-5: ActivityManagerProxy.getRecentActivity, popularity determination worker
console.log.dump1.6: CommunityActionsWidget, execute Freemarker result, popularity determination worker
console.log.dump1.7: PopularBlogPostsWidget, ActivityManagerProxy.getRecentActivity, popularity determination worker
console.log.dump1.8: PopularBlogPostsWidget (same thread), ActivityManagerProxy.getRecentActivity, popularity determination worker
console.log.dump1.9: IteratorProxy, ActivityManagerProxy.getRecentActivity, popularity determination worker
console.log.dump1.10: IteratorProxy, ActivityProxy.getJiveObject, popularity determination work
After reviewing the first 50 thread dumps, my main area of concern is with the DB. It seems as if there must be quite a few queries running without indexes. There are a few things that will help us troubleshoot this:
Thanks!
Austen
austen.rustrum wrote:
After reviewing the first 50 thread dumps, my main area of concern is with the DB. It seems as if there must be quite a few queries running without indexes. There are a few things that will help us troubleshoot this:
It grew organically. We did migrate from one db server to another, however. I believe import/export was a typical mysqldump operation.
I'm not certain, I'll ask our DBA's.
We've very reluctant to do this; it would take a good deal of work to appropriately scrub the data, and doing so may taint its usefulness for you. If absolutely necessary, we can discuss this further w/ Rajeev and our account rep on your side.
It does not appear to be directly related to single vs multi-node, based upon our logged data.
I think I mentioned this earlier in the thread, apologies if not... we have a custom AuthFilter that utilizes REMOTE_USER (coming from mod_auth_kerb) plus LDAP to instantiate user objects. We pretty much followed your developer docs on the subject. Aside from that, our only customizations are the fact that we package via RPM and deploy using puppet.
I'll include this in my request to the DBAs... when I hear back from them, I'll post more info.
Have you been able to make any headway understanding the cache effectiveness data I posted, or coming up w/ some recommended changes there? I'm looking to do a point release in the next week or so, so I really need to pull the trigger on any changes there.
As LG pointed out below, you need to increase the size for two of the caches to the size I have recommended below (document ID cache and the document versions cache). This can be done in the admin console under System > Settings > Caches > Edit Caches. I don't think we need to change the lifespans for any of the caches or run the datagram test, given the historical success stats.
bretm wrote:
I'll include this in my request to the DBAs... when I hear back from them, I'll post more info.
Here are the ten longest-running queries (sanitized) since we migrated from one db host to another:
Count: 1 Time=29.70s (29s) Lock=0.00s (0s) Rows=1.0 (1), clearspace_user[clearspace_user]@app02
SELECT taskID, parentObjectType, parentObjectID, subject, body, userID, assignorID, ownerID, creationDate, modificationDate, dueDate, completed FROM jivePTask WHERE taskID = N
Count: 1 Time=27.91s (27s) Lock=0.00s (0s) Rows=1.0 (1), clearspace_user[clearspace_user]@app02
SELECT COUNT(jiveSGroupMember.userID) FROM jiveUser, jiveSGroupMember WHERE N=N AND jiveSGroupMember.userID = jiveUser.userID AND jiveSGroupMember.groupID = N AND ( jiveSGroupMember.memberType = N ) AND jiveSGroupMember.userID = N
Count: 1 Time=26.30s (26s) Lock=0.00s (0s) Rows=0.0 (0), clearspace_user[clearspace_user]@app02
UPDATE jivePTask SET subject = 'S', body = 'S', assignorID = N, ownerID = N, modificationDate = N, dueDate = N, dueDateNull = N, completed = N WHERE taskID = N
Count: 1 Time=26.09s (26s) Lock=0.00s (0s) Rows=0.0 (0), clearspace_user[clearspace_user]@app02
INSERT INTO jiveWidgetFrmProp (frameID, name, propValue) VALUES (N,'S','S')
Count: 3 Time=24.73s (74s) Lock=0.00s (0s) Rows=0.7 (2), clearspace_user[clearspace_user]@2hosts
SELECT userID, username, passwordHash, name, firstName, lastName, nameVisible, email, emailVisible, userEnabled, creationDate, modificationDate, lastLoggedIn, email, isExternal, federated, status FROM jiveUser WHERE username='S'
Count: 3 Time=24.55s (73s) Lock=0.00s (0s) Rows=0.0 (0), clearspace_user[clearspace_user]@app02
UPDATE jiveProperty SET propValue='S' WHERE name='S'
Count: 1 Time=24.16s (24s) Lock=0.00s (0s) Rows=5.0 (5), clearspace_user[clearspace_user]@app02
SELECT statValue, statDate FROM jiveStat WHERE statID = N AND communityID = N AND statDate >= N AND statDate <= N
Count: 1 Time=23.96s (23s) Lock=0.00s (0s) Rows=0.0 (0), wordpressmu_user[wordpressmu_user]@www02
UPDATE wp_281_posts SET `post_author` = 'S', `post_date` = 'S', `post_date_gmt` = 'S', `post_content` = 'S', `post_content_filtered` = 'S', `post_title` = 'S', `post_excerpt` = 'S', `post_status` = 'S', `post_type` = 'S', `comment_status` = 'S', `ping_status` = 'S', `post_password` = 'S', `post_name` = 'S', `to_ping` = 'S', `pinged` = 'S', `post_modified` = 'S', `post_modified_gmt` = 'S', `post_parent` = 'S', `menu_order` = 'S', `guid` = 'S' WHERE ID = 'S'
Count: 1 Time=22.74s (22s) Lock=0.00s (0s) Rows=1.0 (1), clearspace_user[clearspace_user]@app02
SELECT internalDocID, typeID FROM jiveDocument WHERE documentID='S'
Count: 4 Time=22.36s (89s) Lock=0.00s (0s) Rows=1.0 (4), clearspace_user[clearspace_user]@2hosts
SELECT id FROM jiveID WHERE idType=N
If you need some sample queries run, I can now do that myself.
Hi Bret,
I think it is pretty clear from these query times that there are some problems with the DB. Have you shown these queries to your DBA? Here's one that really grabs my attention:
Count: 1 Time=26.09s (26s) Lock=0.00s (0s) Rows=0.0 (0), clearspace_user[clearspace_user]@app02
INSERT INTO jiveWidgetFrmProp (frameID, name, propValue) VALUES (N,'S','S')
A simple insert took 26 seconds to complete. Several of the other queries that should be returning very quickly are also taking an exceedingly long time. Consider this one:
Count: 4 Time=22.36s (89s) Lock=0.00s (0s) Rows=1.0 (4), clearspace_user[clearspace_user]@2hosts
SELECT id FROM jiveID WHERE idType=N
The jiveID table should have less than 100 records in it and this query is returning only 1 of them, so there is no way that it should ever take this long for the query to complete. Can you please work with your DBA to validate that these numbers and, if so, to correct the DB problems?
Thanks,
Austen
austen.rustrum wrote:
Hi Bret,
I think it is pretty clear from these query times that there are some problems with the DB. Have you shown these queries to your DBA? Here's one that really grabs my attention:
Count: 1 Time=26.09s (26s) Lock=0.00s (0s) Rows=0.0 (0), clearspace_user[clearspace_user]@app02 INSERT INTO jiveWidgetFrmProp (frameID, name, propValue) VALUES (N,'S','S')A simple insert took 26 seconds to complete. Several of the other queries that should be returning very quickly are also taking an exceedingly long time. Consider this one:
Count: 4 Time=22.36s (89s) Lock=0.00s (0s) Rows=1.0 (4), clearspace_user[clearspace_user]@2hosts SELECT id FROM jiveID WHERE idType=NThe jiveID table should have less than 100 records in it and this query is returning only 1 of them, so there is no way that it should ever take this long for the query to complete. Can you please work with your DBA to validate that these numbers and, if so, to correct the DB problems?
Thanks,
Austen
I'll be working w/ the dba's a bit more; my concern though is w/ the counts; none of these has a particularly high occurence rate (if I'm interpreting the report accurately).
To put this in scale, in the past two weeks, we've had 1700+ requests to /clearspace/index.jspa that have taken longer than 10s to complete.
I'll continue to work w/ the DBA's to get a broader picture of db performance, but right now, I'm viewing these as outliers.
Can you provide what queries you'd like to ensure are functioning within a given range? I can then run some ad-hoc testing of these manually, in addition to the reports from the dba's.
Thanks,
--Bret
There are thousands of queries in the application, so giving a few sample queries will be a bit like looking for the needle in the haystack. Can you please upload your entire slow queries log to this case so I can do some analysis? I did see that the counts were low, but they are concerning nonetheless. The other queries that show up in that list should likewise be very quick to complete.
austen.rustrum wrote:
There are thousands of queries in the application, so giving a few sample queries will be a bit like looking for the needle in the haystack. Can you please upload your entire slow queries log to this case so I can do some analysis? I did see that the counts were low, but they are concerning nonetheless. The other queries that show up in that list should likewise be very quick to complete.
Our whole slow query log is a bit oversized (many gigs, dba's owe me some log rotation
), but the following seem to be the most likely candidates:
# Query 11: 8.40 QPS, 0.00x concurrency, ID 0x939EE24E862D3EE8 at byte 10746215707
# This item is included in the report because it matches --limit.
# pct total min max avg 95% stddev median
# Count 0 64
# Exec time 1 1259s 2s 52s 20s 35s 25s 6s
# Lock time 0 5ms 36us 163us 71us 138us 26us 54us
# Rows sent 0 0 0 0 0 0 0 0
# Rows exam 0 0 0 0 0 0 0 0
# Users 1 clearspace_user
# Databases 1 clearspace
# Time range 2009-02-10 12:52:29 to 2009-03-26 15:32:31
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s ###########################################
# 10s+ ################################################################
UPDATE jiveUser SET name='XXX', firstName=null, lastName=null, nameVisible=1, email='XXX', emailVisible=1, passwordHash='XXX', userEnabled = 1, creationDate=1235626296657, modificationDate=1236813631789, lastLoggedIn = 1236813631789, username = 'XXX', isExternal=0, federated=1, status = 6 WHERE userID=3469\G
# Query 19: 6.56 QPS, 0.00x concurrency, ID 0x5D077CA4494210DD at byte 10746215389
# This item is included in the report because it matches --limit.
# pct total min max avg 95% stddev median
# Count 0 29
# Exec time 0 443s 2s 54s 15s 32s 10s 7s
# Lock time 0 1ms 26us 45us 35us 38us 9us 33us
# Rows sent 0 0 0 0 0 0 0 0
# Rows exam 0 0 0 0 0 0 0 0
# Users 1 clearspace_user
# Databases 1 clearspace
# Time range 2009-01-28 19:25:44 to 2009-03-26 07:43:38
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s ################################################################
# 10s+ ###########################################################
UPDATE jiveID SET id = id + 1 WHERE idType=310 AND ((@updatedId := id) IS NULL OR (@updatedId := id) IS NOT NULL)\G
# Query 21: 7.57 QPS, 0.00x concurrency, ID 0x89B06AEF35066F71 at byte 10746214552
# This item is included in the report because it matches --outliers.
# pct total min max avg 95% stddev median
# Count 0 18
# Exec time 0 328s 2s 58s 18s 32s 12s 13s
# Lock time 0 718us 20us 223us 39us 40us 7us 26us
# Rows sent 0 0 0 0 0 0 0 0
# Rows exam 0 0 0 0 0 0 0 0
# Users 1 clearspace_user
# Databases 0
# Time range 2009-01-30 11:47:41 to 2009-03-20 10:48:20
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s ########################################
# 10s+ ################################################################
DELETE FROM jiveDraftProp WHERE draftID = 19484\G
# Query 22: 6.55 QPS, 0.00x concurrency, ID 0x49A9C85C3F016806 at byte 4183439444
# This item is included in the report because it matches --outliers.
# pct total min max avg 95% stddev median
# Count 0 31
# Exec time 0 310s 398ms 199s 10s 5s 3s 4s
# Lock time 0 2ms 16us 118us 51us 84us 17us 47us
# Rows sent 0 0 0 0 0 0 0 0
# Rows exam 0 0 0 0 0 0 0 0
# Users 1 clearspace_user
# Databases 0
# Time range 2009-01-26 22:03:14 to 2009-03-24 12:12:41
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms ############
# 1s ################################################################
# 10s+ ##
DELETE FROM jiveStat WHERE communityID = 2315\G
# Query 23: 6.86 QPS, 0.00x concurrency, ID 0x76D871A02A425819 at byte 9237182593
# This item is included in the report because it matches --outliers.
# pct total min max avg 95% stddev median
# Count 0 16
# Exec time 0 160s 2s 31s 10s 26s 6s 7s
# Lock time 0 682us 21us 66us 42us 60us 11us 44us
# Rows sent 0 0 0 0 0 0 0 0
# Rows exam 0 0 0 0 0 0 0 0
# Users 1 clearspace_user
# Databases 1 clearspace
# Time range 2009-01-28 09:25:41 to 2009-03-23 09:25:10
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s ################################################################
# 10s+ #############################
UPDATE jiveViewCount SET viewCount=127 WHERE objectType=102 AND objectID=18202\G
# Query 24: 15.42 QPS, 0.00x concurrency, ID 0x9912D8641A81158E at byte 3524340157
# This item is included in the report because it matches --outliers.
# pct total min max avg 95% stddev median
# Count 0 23
# Exec time 0 80s 2s 8s 3s 5s 2s 3s
# Lock time 0 472us 14us 31us 20us 27us 2us 19us
# Rows sent 0 0 0 0 0 0 0 0
# Rows exam 0 0 0 0 0 0 0 0
# Users 1 clearspace_user
# Databases 1 clearspace
# Time range 2009-02-11 02:09:36 to 2009-03-07 02:31:24
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s ################################################################
# 10s+
INSERT INTO jiveStat(statID,communityID,statDate,statValue) VALUES(303,2908,1234414800000,0.0)\G
(Full log attached.)
I'll be doing some more analysis on a smaller slice of data, trying to tie the slow queries up w/ the access log render times.
These numbers should be of great concern. All of those queries should be returning in sub-millisecond times, not multiple seconds and certainly not an average of 10-20s. Has your DBA done an analysis of the DB to see what the problem might be? The lock time on the queries is really low, so it doesn't appear to be a transactional issue.
I've engaged our DBAs, moving this to waiting on me until we hear back from them.
Removing auto-closing from this bug. Still waiting on DBA input, but some other changes went live that should help us pinpoint whether this is a db problem (slimmed down the slow-queries log, etc).
I've gotten some feedback from the DBA's; we're still seeing some issues crop up in our slow queries log. I've listed them below, w/ the corresponding explain plan from our stage env:
> Count: 2 Time=25.21s (50s) Lock=0.00s (0s) Rows=0.0 (0), xxx[xxx]@2hosts
> UPDATE jiveViewCount SET viewCount=N WHERE objectType=N AND objectID=N
mysql> EXPLAIN EXTENDED select viewCount=127 from jiveViewCount where objectType=102 AND objectID=2024;
+----+-------------+---------------+-------+---------------+---------+---------+-------------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+-------+---------------+---------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | jiveViewCount | const | PRIMARY | PRIMARY | 12 | const,const | 1 | 100.00 | |
+----+-------------+---------------+-------+---------------+---------+---------+-------------+------+----------+-------+
> Count: 6 Time=23.82s (142s) Lock=0.00s (0s) Rows=0.0 (0), xxx[xx]@app01
> UPDATE jiveUser SET name='S', firstName=null, lastName=null, nameVisible=N, email='S', emailVisible=N, passwordHash='S', userEnabled = N, creationDate=N, modificationDate=N, lastLoggedIn = N, username = 'S', isExternal=N, federated=N, status = N WHERE userID=N
mysql> EXPLAIN EXTENDED select name, firstName, lastName, nameVisible, email, emailVisible, passwordHash, userEnabled, creationDate, modificationDate, lastLoggedIn, username, isExternal, federated, status from jiveUser where userID=3000;
+----+-------------+----------+-------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | jiveUser | const | PRIMARY | PRIMARY | 8 | const | 1 | 100.00 | |
+----+-------------+----------+-------+---------------+---------+---------+-------+------+----------+-------+
> Count: 1 Time=21.85s (21s) Lock=0.00s (0s) Rows=1.0 (1), xxx[xxx]@app01
> SELECT bodyData FROM jiveDocumentBody WHERE bodyID=N
mysql> explain extended SELECT bodyData FROM jiveDocumentBody WHERE bodyID=20527;
+----+-------------+------------------+-------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+-------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | jiveDocumentBody | const | PRIMARY | PRIMARY | 8 | const | 1 | 100.00 | |
+----+-------------+------------------+-------+---------------+---------+---------+-------+------+----------+-------+
> Count: 3 Time=20.38s (61s) Lock=0.00s (0s) Rows=0.0 (0), xxx[xxx]@2hosts
> UPDATE jiveID SET id = id + N WHERE idType=N AND ((@updatedId := id) IS NULL OR (@updatedId := id) IS NOT NULL)
Er, skip, weird syntax :)
> Count: 1 Time=15.80s (15s) Lock=0.00s (0s) Rows=0.0 (0), xxx[xxx]@app01
> SELECT draftID, userID, draftType, objectType, objectID, subject, body, modificationDate FROM jiveDraft WHERE userID=N AND draftType = N AND objectType = N AND objectID = N
mysql> EXPLAIN EXTENDED select * from jiveDraftProp WHERE draftID = 19484;
+----+-------------+---------------+------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | jiveDraftProp | ref | PRIMARY | PRIMARY | 8 | const | 1 | 100.00 | |
+----+-------------+---------------+------+---------------+---------+---------+-------+------+----------+-------+
> Count: 2 Time=14.48s (28s) Lock=0.00s (0s) Rows=1.0 (2), xxx[xxx]@2hosts
> SELECT count(*) FROM jiveID
mysql> explain extended select count(*) from jiveID;
+----+-------------+--------+-------+---------------+---------+---------+------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+------+----------+-------------+
| 1 | SIMPLE | jiveID | index | NULL | PRIMARY | 4 | NULL | 57 | 100.00 | Using index |
+----+-------------+--------+-------+---------------+---------+---------+------+------+----------+-------------+
> Count: 1 Time=8.64s (8s) Lock=0.00s (0s) Rows=0.0 (0), xxx[xxx]@app01
> DELETE FROM jiveDraftProp WHERE draftID = N
mysql> EXPLAIN EXTENDED select * from jiveDraftProp WHERE draftID = 19484;
+----+-------------+---------------+------+---------------+---------+---------+-------+------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------+---------------+---------+---------+-------+------+----------+-------+
| 1 | SIMPLE | jiveDraftProp | ref | PRIMARY | PRIMARY | 8 | const | 1 | 100.00 | |
+----+-------------+---------------+------+---------------+---------+---------+-------+------+----------+-------+
Several requests yesterday took 100s+ to finish; I don't have full tally yet, or the ratio of slow requests to slow queries yet.
What is your experience when running these queries directly against the DB? Many of them are filtering only on the primary key, which is indexed and should have a constant time access. Is the DB running on a VM? What kind of memory/CPU/network stats do you see on the DB server? Is this running on MySQL 5.0.x? Which version of the connector are you running?
austen.rustrum wrote:
What is your experience when running these queries directly against the DB?
The selects run quite fast against the db. What we're seeing in the slow query logs is periodic slowness based upon mostly updates and deletes.
Is the DB running on a VM?
No.
What kind of memory/CPU/network stats do you see on the DB server?
The box is pretty much idle. DBAs are looking into a theory around getting mysql to tell us when it's doing some of its internal log rotations to see if that's related; but recent slow-query-log entries have only been on the order of 30'ish seconds, tops. Apache request times are much, much worse. Graphs to come.
Is this running on MySQL 5.0.x?
Nope: MySQL 5.1.26-rc-community-log , though we'll be upgrading to 5.1.30 very soon.
Which version of the connector are you running?
MySQL-AB JDBC Driver mysql-connector-java-5.0.7 ( $Date: 2007-03-09 22:13:57 +0100 (Fri, 09 Mar 2007) $, $Revision: 6341 $ )\
I'm becoming less convinced that the db is the biggest problem re: performance. While the 30s+ times are sucky, they're a drop in the bucket, and don't correspond well to the worst request times. I've attached a graph of the two.
Two things that I would like you to try:
austen.rustrum wrote:
Two things that I would like you to try:
Austen,
--Bret
Can you not replicate the problem with the same data in your staging environment? From the thread dumps,. it doesn't look like it takes any load to reproduce the issue. The point of this exercise is to eliminate as many parts of the equation as possible so we can identify the true culprit. By removing Apache from the equation, we can definitively say whether it has anything to do with the problem. MySQL connectors have been known to have bugs and it could be something in the JDBC layer causing a problem, so switching the connector would help us rule out a problem with the specific connector that you are using.
I'm closing this case at this point, we seem to have found the major performance stumbling block: mod_jk.
When we updated to mod_jk/1.2.28 on RHEL 5, that seems to have had a pretty massive performance improvement.
If there are subsequent performance issues, I'll open a new, more targeted ticket.
Many thanks to both you and Erik for working w/ us on this gnarled issue.
Hi Bret,
if you really want to take a look at the SQL queries you may use the very simple one: "SELECT id FROM jiveID WHERE idType=N"
Austen did write: "The jiveID table should have less than 100 records in it and this query is returning only 1 of them, so there is no way that it should ever take this long for the query to complete."
On http://dev.mysql.com/tech-resources/articles/4.1/prepared-statements.html and http://dev.mysql.com/doc/refman/5.1/en/sql-syntax-prepared-statements.html are examples how to run prepared statements from the command line.
You may return all valid IDs (less than 100) and then pick a random one, I assume that the ID 1 exists:
mysql> SELECT id form jiveID
mysql> PREPARE stmt_name FROM "SELECT id FROM jiveID WHERE idType=?";
mysql> SET @test_parm = 1;
mysql> EXECUTE stmt_name USING @test_parm;
mysql> DEALLOCATE PREPARE stmt_name;
Anyhow I still wonder whether Clearspace has the same issues when you use a JBoss/Tomcat HTTP connector. Which java options are you using (PermSize, GC settings, ...)? Long running garbage collections because of a small PermSize can cause such trouble, anyhow the JVM should report OutOfMemory errors sooner or later which does not seem to be the case for you.
LG
Hi Austen,
what could "Can you please provide a report of the 10 longest running queries from MySQL?" help? Looking at the stack traces I see only a few "com.mysql.jdbc.Connection.execSQL()" statements in the stack traces, so I would say that the SQL is executed very fast.
-------
There are 80 "ajp-xxx-8009-NN" threads, I wonder whether there is a bottleneck within Apache or mod_jk. Do you encounter the same performance problems when you use the HTTP connector of JBoss?
LG
The AJP threads are coming from the Apache connector and are mostly sitting around waiting for work to do and are nothing to be concerned with. In fact, the majority of the ~300 threads are pretty boring in the thread dumps, just sitting around and waiting for something to do. The threads that are actually doing something will show a stack that is somewhere in the Jive code. If you look at thread dumps over a period of time, you can search for long running threads (same thread ID in the same section of code). There were a few areas of concern from what I could see, with ActivityManagerProxy and ViewCommunitiesAction being the worst offenders. The problem isn't necessarily in the DB--it could simply be in the number of calls that are being made--but it definitely could be a DB problem, so we need to rule that out. By analyzing the long running queries, we'll be able to determine if there is anything in that layer that is cause for concern.
Hi,
just my two cents about cache sizes:
Document ID Cache 0.12 MB 0.12 MB 98.0%
Document Versions Cache 0.12 MB 0.12 MB 92.4%
I would double the size of both, and if they fill again >90% I'd double them again and again. Even setting them to 10 MB should not be a problem within your environment.
LG
it2000 wrote:
Hi,
just my two cents about cache sizes:
Document ID Cache 0.12 MB 0.12 MB 98.0%
Document Versions Cache 0.12 MB 0.12 MB 92.4%
I would double the size of both, and if they fill again >90% I'd double them again and again. Even setting them to 10 MB should not be a problem within your environment.
LG
@LG, any recommendations on the maxLifetimes for those caches? I'm thinking perhaps multi-day, and go for 4x current size to compensate, or something like that?
One more thing:
Since you have the memory capacity to do so, you can switch to the Large Site preset in the Edit Caches page. This will bump up the sizes on all of your caches to a total size of ~575MB and will increase the sizes of the two problem caches to .5MB. If you haven't had a lot of traffic go against the site and you expect to have some decent load, this would be a good idea.
austen.rustrum wrote:
One more thing:
Since you have the memory capacity to do so, you can switch to the Large Site preset in the Edit Caches page. This will bump up the sizes on all of your caches to a total size of ~575MB and will increase the sizes of the two problem caches to .5MB. If you haven't had a lot of traffic go against the site and you expect to have some decent load, this would be a good idea.
Austen,
I'll go ahead and move up to "Large", but that's still not going to up the effectiveness of many of the caches, since many of them are ineffective, but not full. I'm interpreting this as cached items reaching their maxLifetime limit, and getting evicted as a result.
I've gone through the cache effectiveness doc I uploaded earlier, and intend to increase the maxLifetime on highly-used, ineffective caches. I've attached the diff of jive_startup.xml; please review and let me know if you forsee any negative consequences. Some form of this change will be going into our next internal release.
Thanks,
--Bret
PS--the threading on this is getting whacked... is this something fixable on your side, or a preference I need to set?
the threading on this is getting whacked... is this something fixable on your side, or a preference I need to set?
Going to preferences and selecting there "flat view" for threads improve things a lot.
@LG: indeed, thanks!
I'd agree with that, except I would bump them each up to 1MB, especially if you use documents heavily. Also, once your caches start to get about 60% full, you should start thinking about increasing their size as they can roll over quickly.
Hi,
your graphs are not very helpful. To see if you have a bottleneck you need to run "vmstat 1" on your server and look for peaks - "vmstat 60" does not provide enough details to see why one request did take long. Also the memory usage does not show how often a GC does occur. You should start Openfire with more parameters like
-XX:+PrintGCTimeStamps
-XX:+PrintGCDetails
-Xloggc:/tmp/gc.log
and monitor also the /tmp/gc.log file. It could help to use another garbage collector, but one should look at this file before trying this.
LG
@Austen: Thanks for the info. If the cache uses a weak hashmap then one does not really need to care about memory. The JVM will purge the cache when needed - of course that's not a very good solution but better than an OutOfMemory error.
Austen,
Just saw this in the 2.5.9 documentation:
Do you think this may be part of our performance issues?
Bret,
It certainly won't hurt. On the long running ViewCommunitiesAction, I did see it engaged in various counting operations for subcommunities (doc counts, blog post counts, etc.), but that section of code specifically did not surface in the thread dumps I examined. As far as the cache effectiveness is concerned, have you tried running a load test against your staging instance? I'm guessing you have fairly low effectiveness because you have fairly low usage at the moment? Do you expect that to change in production or will the use be fairly limited? If the use is limited and you want to prevent the initial hit to the DB, increasing the lifespan for your caches makes sense. However, if not, your caches could fill up at a much quicker rate since objects won't be expiring from them as quickly. Just something to be aware of. Have you run any tests since increasing the sizes of the ineffective document ID and version caches? Also, how is your community tree structured? Is it relatively flat or is it fairly deep? Do you use a lot of negative permissions?
Thanks,
Austen
austen.rustrum wrote:
Bret,
It certainly won't hurt. On the long running ViewCommunitiesAction, I did see it engaged in various counting operations for subcommunities (doc counts, blog post counts, etc.), but that section of code specifically did not surface in the thread dumps I examined. As far as the cache effectiveness is concerned, have you tried running a load test against your staging instance? I'm guessing you have fairly low effectiveness because you have fairly low usage at the moment? Do you expect that to change in production or will the use be fairly limited? If the use is limited and you want to prevent the initial hit to the DB, increasing the lifespan for your caches makes sense. However, if not, your caches could fill up at a much quicker rate since objects won't be expiring from them as quickly. Just something to be aware of. Have you run any tests since increasing the sizes of the ineffective document ID and version caches? Also, how is your community tree structured? Is it relatively flat or is it fairly deep? Do you use a lot of negative permissions?
Thanks,
Austen
Hi Bret,
I'd still like to know whether you encounter the same performance problems when you use the HTTP connector of JBoss.
LG
@Austen: As there are no locks in the thread dumps, no long-running SQL's I wonder whether it helps to change the JVM and the application server ...
We don't have this deployed in production, and it'd be a major step to rule that out. We'd need more indication that it's in the mod_jk/ajp sphere before going down that path.
@LG: We can see in the thread dumps that there are long running threads on the application server, and these problems need to be resolved. From my analysis, the HTTP connector does not appear to be a factor in the performance problem. Can you explain what you are seeing that is causing you to think this? Also, what makes you think that there are not long running SQL queries? As I mentioned before, it is possible that it is not a problem, but in my experience, it is difficult to make that assessment by looking at thread dumps alone, especially if they were taken several seconds apart. DB metrics are a far more reliable source of information in this regard.
Hi Austen,
grep -i mysql console.log.dump* | grep -v "MySQL Statement Cancellation Timer"shows that there are a few SQL statements executed, anyhow max. 1 per dump. So I would say there are no long-running transactions.
"From my analysis, the HTTP connector does not appear to be a factor in the performance problem."
I assume you mean the AJP connector. One thing I notice is that ~65 AJP threads are "RUNNABLE" and ~15 are "WAITING (on object monitor)" - I have no idea why. Anyhow this may be completely irrelevant if Apache is the bottleneck. So testing this with a HTTP connector (port 808x) simply makes sure that Apache/AJP is not causing trouble.
LG
Hi LG,
Here's what a connection to the DB looks like in a stack trace:
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at com.mysql.jdbc.util.ReadAheadInputStream.fill(ReadAheadInputStream.java:113)
at com.mysql.jdbc.util.ReadAheadInputStream.readFromUnderlyingStreamIfNecessary(ReadAheadInputStream.java:160)
at com.mysql.jdbc.util.ReadAheadInputStream.read(ReadAheadInputStream.java:188)
- locked <0x00002aaad308da50> (a com.mysql.jdbc.util.ReadAheadInputStream)
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1994)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2411)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2916)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1631)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:1723)
at com.mysql.jdbc.Connection.execSQL(Connection.java:3250)
- locked <0x00002aaad2f89030> (a java.lang.Object)
at com.mysql.jdbc.Connection.setTransactionIsolation(Connection.java:5704)
- locked <0x00002aaad2fed148> (a com.mysql.jdbc.Connection)
at sun.reflect.GeneratedMethodAccessor98.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.logicalcobwebs.proxool.WrappedConnection.invoke(WrappedConnection.java:162)
at org.logicalcobwebs.proxool.WrappedConnection.intercept(WrappedConnection.java:87)
at $java.io.Serializable$$EnhancerByProxool$$b37f4c28.setTransactionIsolation(<generated>)
at com.jivesoftware.base.database.ConnectionPool.createCon(ConnectionPool.java:257)
at com.jivesoftware.base.database.ConnectionPool.getConnection(ConnectionPool.java:164)
at com.jivesoftware.base.database.DefaultConnectionProvider.getConnection(DefaultConnectionProvider.java:106)
at com.jivesoftware.base.database.dao.JiveDataSource.getConnection(JiveDataSource.java:34)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:113)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:79)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:577)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:641)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:670)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:678)
at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:721)
at org.springframework.jdbc.core.simple.SimpleJdbcTemplate.queryForObject(SimpleJdbcTemplate.java:169)
at com.jivesoftware.base.database.dao.JiveJdbcOperationsTemplate.queryForObject(JiveJdbcOperationsTemplate.java:174)
at com.jivesoftware.community.impl.dao.UserContainerDAOImpl.getContainer(UserContainerDAOImpl.java:80)
at com.jivesoftware.community.impl.UserContainerManagerImpl.getUserContainer(UserContainerManagerImpl.java:49)
at com.jivesoftware.community.JiveContainerManagerImpl.getJiveContainer(JiveContainerManagerImpl.java:111)
at com.jivesoftware.community.impl.DbBlog.getJiveContainer(DbBlog.java:1894)
at com.jivesoftware.community.util.JiveContainerPermHelper.isContainerModerator(JiveContainerPermHelper.java:67)
at com.jivesoftware.community.util.CommentPermHelper.getCanModerateComments(CommentPermHelper.java:46)
at com.jivesoftware.community.proxy.CommentManagerProxy.getCommentCount(CommentManagerProxy.java:58)
at com.jivesoftware.community.proxy.CommentManagerProxy.getCommentCount(CommentManagerProxy.java:49)
I'm not sure what the MySQL Statement Cancellation Timer is, but it looks like maybe a background thread to kill off connections that are not responding. In any case, it appears that it misses DB activity. Also, as I pointed out above, thread dumps are a snapshot in time. A query that runs for 3 seconds is still a long-running query, especially if it is run multiple times per page request. Thread dumps that are spaced out over time may or may not catch that.
I assume you mean the AJP connector. One thing I notice is that ~65 AJP threads are "RUNNABLE" and ~15 are "WAITING (on object monitor)" - I have no idea why. Anyhow this may be completely irrelevant if Apache is the bottleneck. So testing this with a HTTP connector (port 808x) simply makes sure that Apache/AJP is not causing trouble.
Yes, thank you for catching that, I did mean AJP. It would be a good test to see if hitting the app server directly helps any, but empirical evidence thus far suggests that the problem is on the app server. As for the difference in the runnable vs. waiting--perhaps only a few connections are actively waiting for work and the rest remain dormant until the load on the app server increases? Just a thought...
Hi Austen,
I really wonder what "WAITING" does mean. I did take a look at app01/console.log.dump.1.5 - 1.20 and did search for "ajp-10.5.30.13-8009-8". This seems to be an active or passive AJP thread which seems to take ages to complete. I have no idea if this is one request or not, anyhow it seems that 1.5-1.10 do work/hang in "com.jivesoftware.community.action.ViewCommunitiesAction.execute(ViewCommunitiesAction.java:53-55)".
In the dumps 1.11-1.17 it seems that "com.jivesoftware.community.web.struts.FreemarkerResult.doExecute" works/hangs with "WAITING" while in 1.18-1.20 the thread is RUNNABLE.
It's quite unusal to get dumps which look similar, so one may really wonder whether the JVM has a problem. I also wonder that the line numbers are always displayed. For Jive SBS you have a setup script which sets ulimits, and other system parameters. I wonder wheterh this could help.
LG
PS: Also http://markmail.org/message/tzq5pyt476yzh5xo#query:org.apache.tomcat.util.net.JIoEndpoint%24Worker.await(JIoEndpoint.java%3A416)+page:1+mid:qwox7atferka7kma+state:results describes a problem which sounds similar: "... a POST request ... can take 250s ... while this same request can pass OK in hundreds of ms. ..." There the situation can be improved with setting the #file ulimit from 1024 to 4096:
" ... to persist on reboot edit your /etc/security/limits.conf and add follow lines:
* soft nofile 4096
* hard nofile 10240
..."
Hi,
This thread piqued my interest as well. Here are some suggestions.
1. Considerable number of ajp processors are waiting for work from the webserver.
Please consider setting the connectionTimeout and connection_pool_timeout if applicable,
to recycle the connections.
http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
Also, please review the following link, JBOSS's default mod_jk configuration seem
to be inadequate
http://www.jboss.org/community/docs/DOC-11543 /* yay, using clearspace */
2. Also, it may be worthwhile to check the Min(Max)Servers configuration set (if any) in
Apache configuration as well.
3. Are you using the Apache Tomcat APR Native Library ? If so, please try taking that out and
see whether it gives any relief.
http://tomcat.apache.org/tomcat-6.0-doc/apr.html
4. mysql - any queries showing up in slow query log (if setup to log) ? If so, please check the access paths,
may be time to optimize some heavily acted upon tables.
Best Regards,
-senthil
senthilv wrote:
Hi,
This thread piqued my interest as well. Here are some suggestions.
1. Considerable number of ajp processors are waiting for work from the webserver.
Please consider setting the connectionTimeout and connection_pool_timeout if applicable,
to recycle the connections.
http://tomcat.apache.org/connectors-doc/generic_howto/timeouts.html
Also, please review the following link, JBOSS's default mod_jk configuration seem
to be inadequate
http://www.jboss.org/community/docs/DOC-11543 /* yay, using clearspace */
2. Also, it may be worthwhile to check the Min(Max)Servers configuration set (if any) in
Apache configuration as well.
3. Are you using the Apache Tomcat APR Native Library ? If so, please try taking that out and
see whether it gives any relief.
http://tomcat.apache.org/tomcat-6.0-doc/apr.html
4. mysql - any queries showing up in slow query log (if setup to log) ? If so, please check the access paths,
may be time to optimize some heavily acted upon tables.
Best Regards,
-senthil
1) We not even close to hitting maxThreads (currently set to 250); AJP threads started is ~70 to 80, and the number of threads busy tends to max out around 60-70'ish. Additionally, we do have the connection timeout set to 600000.
2) What are you looking for in this area? We have tons of spare httpd processes at our proxy layer... general configs are:
The proxies also serve other applications as well via separate vhosts, and are quite speedy there.
3) Removing APR isn't really an option w/out some clearer data to suggest that's the cause.
4) We just got some feedback on the queries, see my previous post.
Cheers,
--Bret
Jive combines collaboration software, community software & social networking software into the leading SBS solution.
© Copyright 2000–2009 Jive Software. All rights reserved.
915 SW Stark St., Suite 400, Portland, OR 97205