|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
We're in the process of evaluating EO.PDF (I've had several e-mails already with Jack about other aspects). While writing a new test application for the product I noticed that there seemed to be a number of "zombie" Rundll32.exe instances running that were spawned by EO.PDF. I just updated to the latest download of EO.Total on the site (107) and the EO.Pdf assembly shows 5.0.84.2 as it's version number.
I know that EO.PDF spawns rundll32.exe instances for purposes of loading the native code packed in the .NET assembly and performing the actual conversion from HTML to PDF. From reading on these forums I understand that these will remain running for a while until they time out from inactivity (I think one post suggested ~5mins is the timeout limit for these), that the process will spawn a number of these as needed (1 per concurrent conversion request), and that the process will manage references to these in order to reuse them for future requests. Under normal conditions, these child processes (rundll32.exe instances) seem to be deleted once the application ends. [I haven't tested idle timeouts yet.]
However if I pass a very large HTML file (that I know from previous attempts will fail to convert), the test application catches the exception throw from the ConvertUrl call so that I can report the failure however when the test application exits - the rundll32.exe instance(s) are not destroyed in this case. Instead they become apparent "zombies" and continue to linger. After 90 minutes I still see these instances in memory and I have to manually kill each. Each is using over 1.7GB of memory.
For purposes of my test application, I was initially calling for 4 conversions concurrently using tasks to concurrently spawn the conversions. The application waits for all of the tasks to complete before exiting. What I get when the application exits is 4 zombie rundll32.exe instances.
I then changed the test application to stagger the requests using a 30secs sleep between iterations of the task spawning loop. What I saw then was at most 2 instances of rundll32.exe in memory (as it takes about 46secs to reach out of memory situation) at one time and I did see the rundll32.exe instances getting cleaned up after running out of memory except it seems for one instance that returns a "Attempted to read or write protected memory" instead of the "This session is no longer valid" exception that the other requests report. When the test application exits this one instance becomes a zombie.
I then changed the test application further to sleep for 30secs after the wait for all tasks to complete. In this case the one instance that reported "Attempted to read or write protected memory" stayed in memory until the process exited, briefly was listed as a separate process, and then disappeared. However next time I repeated this situation it just stayed in memory as a zombie... After some further investigation it seems that if the last request reports the "Attempted to read or write protected memory" exception then it goes zombie after the test application exits but if it's an earlier request (3rd) that throws that exception then it will get deleted when the test application terminates.
Is there some way that I can get the child process ID for each conversion request so that I can kill off rundll32.exe after an exception if it doesn't self destruct after some reasonable time?
Is there some way to force the EO.PDF assembly to destroy all idle spawned rundll32 instances?
Is there something special I should be doing when catching the exception from the conversion process?
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
Thank you very much for the detailed information. The design goal is that you should not need to monitor/kill the rundll32.exe process. When an "attempted to read or write protected memory" occurs inside that process, that process may no longer be in stable state and may no longer be capable of exiting cleanly. However as soon as the hosting .NET process is not corrupted, it should be able to monitor the run away rundll32.exe and kill it if necessary. This is one of the main benefit of separate the native code into a separate process. So as to your question, there should be no action required on your end to make sure they eventually exit. We will investigate and see why it did not do so.
Having that said, it's possible for us to add an explicit call to clean up those exes. While the automatic cleanup should work, it may have a time delay. An explicit clean up call would allow you to clean up all child processes immediately.
Thanks!
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
Update: After a few more runs I did catch a case where the 3rd request threw the "attempted to read or write protected memory" error and it still became a zombie after the application ended (previously I only saw that if the last request threw this particular error). Using SysInternals Process Explorer here is the call stack on the "zombie" rundll32.exe instance:
ntoskrnl.exe!KeWaitForMultipleObjects+0xc0a ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x732 ntoskrnl.exe!KeWaitForMutexObject+0x19f ntoskrnl.exe!PoStartNextPowerIrp+0xba4 ntoskrnl.exe!PoStartNextPowerIrp+0x1821 ntoskrnl.exe!KeAcquireSpinLockAtDpcLevel+0x93d ntoskrnl.exe!KeWaitForMutexObject+0x19f win32k.sys+0xcd877 win32k.sys+0xcd911 win32k.sys+0xe04de ntoskrnl.exe!KeSynchronizeExecution+0x3a23 wow64cpu.dll!TurboDispatchJumpAddressEnd+0x6c0 wow64cpu.dll!TurboDispatchJumpAddressEnd+0x676 wow64.dll!Wow64SystemServiceEx+0x1ce wow64.dll!Wow64LdrpInitialize+0x42a ntdll.dll!RtlUniform+0x6e6 ntdll.dll!RtlCreateTagHeap+0xa7 ntdll.dll!LdrInitializeThunk+0xe USER32.dll!WaitMessage+0x15 mscorwks.dll+0x1b4c mscorwks.dll!GetPrivateContextsPerfCounters+0x7e8e mscorwks.dll+0x189c4 mscorwks.dll!CoUninitializeEE+0x11ac mscorwks.dll!CoUninitializeEE+0x11df mscorwks.dll!CoUninitializeEE+0x11fd mscorwks.dll!CreateApplicationContext+0x6e3b mscorwks.dll!CreateApplicationContext+0x6f9a mscorlib.ni.dll+0x215638 mscorlib.ni.dll+0x2153e6 mscorlib.ni.dll+0x2152ce mscorwks.dll+0x1b4c mscorwks.dll+0x189c4 mscorwks.dll!CoUninitializeEE+0x11ac mscorwks.dll!CoUninitializeEE+0x11df mscorwks.dll!CoUninitializeEE+0x11fd mscorwks.dll!CreateApplicationContext+0x4a26 mscorwks.dll!CreateApplicationContext+0x4b5a mscorlib.ni.dll+0x22fad0 mscorlib.ni.dll+0x1c8d86 mscorlib.ni.dll+0x1c0e00 mscorlib.ni.dll+0x6b23aa mscorlib.ni.dll+0x6b729b mscorwks.dll+0x1b4c mscorwks.dll+0x189c4 mscorwks.dll+0x18fb3 mscorwks.dll+0x18ff4 mscorwks.dll!CorExitProcess+0x26d76 mscorwks.dll!CorExitProcess+0x2907f mscorwks.dll!CorExitProcess+0x29836 mscorwks.dll!CorExitProcess+0x297fa mscorwks.dll!CorExitProcess+0x29686 mscorwks.dll!CorExitProcess+0x26a51 mscorwks.dll!CorExitProcess+0x26b87 mscorwks.dll!CorExitProcess+0x26c51 mscorwks.dll+0x1b4c mscorwks.dll+0x189c4 mscorwks.dll!CoUninitializeEE+0x11ac mscorwks.dll!CoUninitializeEE+0x11df mscorwks.dll!CoUninitializeEE+0x11fd mscorwks.dll!CreateApplicationContext+0x6e3b mscorwks.dll!CreateApplicationContext+0x6f9a mscorlib.ni.dll+0x215638 mscorlib.ni.dll+0x2153e6 mscorlib.ni.dll+0x2152ce mscorwks.dll!CorExitProcess+0x17ac5 mscorwks.dll!CorExitProcess+0x17bd5 mscorwks.dll!CorExitProcess+0x17d3e rundll32.exe+0x17c6 ntdll.dll!RtlInitializeExceptionChain+0x63 ntdll.dll!RtlInitializeExceptionChain+0x36
Process Explorer is showing some small CPU usage in this thread (and reporting 7 other threads that don't seem to be using any CPU). My guess is that's the code checking the lock in the WaitForMultipleObjects call is what's using the minimal CPU time. If you want I can try to grab a memory dump and upload it to you (somehow).
Just speculation here... My guess is that you're sending the kill signal to the rundll32 child process via the interprocess communication mechanism where it would normally just allow the child process to exit cleanly on its own instead of terminating the child process directly. In this case it's not doing that properly or it's doing it but going away before the child process has a chance to actually read the signal before the mechanism is closed down. This leaves the child then waiting in perpetuity for a signal that will never arrive.
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
Thank you very much for the detailed technical information and suggestions. The IPC should be working fine in this case. The call stack indicates that rundll32.exe is trying to exit (as CorExitProcess has already been called) but for some reason it hangs. This indeed may happen since after an access violation, some internal data must have already been corrupted. There is no way for us to completely avoid access violation error (for example, if your page use a huge JavaScript string it can easily cause a memory allocation failure thus causing access violation error). While a regular process can just exit easily in such cases, our process loads both .NET code and native code. The .NET exiting process is much more complicated and it can hang when state data is corrupted. We will add additional unmanaged code to monitor the exiting process and forceful terminate the process if we detect a hang in that part.
In the mean time, you can try to run your converter in a separate AppDomain. When you are done with the conversion, you can unload that AppDomain. We specifically handle that situation and kills all rundll32.exe when your AppDomain is unloaded. However this logic does not apply to your main AppDomain, so it will work only if you create a second AppDomain.
Thanks!
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
I've implemented each conversion request in a separate AppDomain as suggested using some other posts I found on this forum as a guide. It seems to work but as it stands it tears down the rundll32 instance after each and every conversion. Since you put resources/energy into reusing Rundll32 instances in the first place I assume that there's a non-negligible hit to performance for loading each instance that I'm now paying on every request by using this approach. So far however I haven't seen any further zombie instances of rundll32.exe. I'll test more tomorrow.
As my test application is currently written, I have dedicated long running worker threads pulling conversion requests from a queue so I could probably optimize this better such that AppDomains are only deleted/recreated when there is an exception from the conversion process. This would have been more involved if I had written it such that each conversion request was a separate task as then I would have had to implement my own cache for these child AppDomains.
Out of curiosity, why does the logic that is killing rundll32 instances only run in child AppDomain cases and not in the main AppDomain?
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
You do not need to tear down the app domain on every conversion. You can do a number of conversions (for example, 100), then tear down the app domain and that will clean up everything. So you might have some run away processes for a short period of time but they will be cleaned up after you tear down the app domain.
The reason that killing rundll32 instance only run in child AppDomain is because we rely on DomainUnload event, which is only fired for child domains. The main app domain does not raise this event.
We have also posted a new build that safe guarded the .NET exiting portion that seems to be causing problem for you. Please see your private message for the download location. You can give that build a try and see if it resolves the problem for you.
Thanks!
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
eo_support wrote: You do not need to tear down the app domain on every conversion. You can do a number of conversions (for example, 100), then tear down the app domain and that will clean up everything. So you might have some run away processes for a short period of time but they will be cleaned up after you tear down the app domain.
Right - it was just the easiest approach to implement as a quick test. Given that these RunDLL32 are going zombie after exhausting their 32-bit process space I really don't want to leave them around for any amount of time as a series of them would quickly grind the machine to a halt due to using up all physical memory and causing massive swapping. After my last post I actually changed my test to be like I was describing - one child AppDomain per worker thread which it keeps reusing until there's an error response (then it recycles). That resulted in a lot more instances of RunDLL32 though as while each AppDomain only ever had 1 single request against it at a time the second request seemed to still spawn another instance of RunDLL32. I think I remember reading in another thread that this is expected behavior but I can't remember if there was a way to control/limit that offhand so likely what I needed to do instead is a shared child AppDomain model which all workers can use but is recycled after the first error event - that actually would make it easier to implement either the long running worker/queue or short duration task model work. eo_support wrote: We have also posted a new build that safe guarded the .NET exiting portion that seems to be causing problem for you. Please see your private message for the download location. You can give that build a try and see if it resolves the problem for you.
However now that you have a build for me to try, I guess I won't need the child AppDomain model anymore. Will put this build through testing shortly and let you know. Thanks for the quick turnaround!
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
CWoods wrote:However now that you have a build for me to try, I guess I won't need the child AppDomain model anymore. Will put this build through testing shortly and let you know. Thanks for the quick turnaround! You are very welcome. We enjoy working with experienced developers like you. What you were doing makes perfect sense and you might still need that code since we weren't able to reproduce the exact problem, so the change was only made based on the call stack you provided (we did verify the code we put in works with some simulated scenarios).
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
No joy unfortunately with the new build... I've stripped the test program down to the very basics. PM me instructions please on how to upload you the "BigFile.html" that I'm using as the input.
Code: C#
using System;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using EO.Pdf;
// Win7 x64
// Build Info: VS2012, AnyCPU, Release Build, .Net Framework 4.5, [Optional] Uncheck "Prefer 32-bit"
namespace TestApp
{
class Program
{
static DateTime g_start = DateTime.Now;
static void LogMsg(string format, params object[] args)
{
Console.WriteLine("[{0:d\\.hh\\:mm\\:ss\\.ffff}] {1}", DateTime.Now - g_start, string.Format(format, args));
}
static void Main(string[] args)
{
CancellationTokenSource cancel = new CancellationTokenSource();
Console.CancelKeyPress += delegate(object sender, ConsoleCancelEventArgs e)
{
e.Cancel = true;
cancel.Cancel();
};
// CONTRIVED [but potentially possible] Error Test Case
// Multiple concurrent conversion requests for the same input/output file
// Each request will return either of the following two errors:
// a) EO.Pdf.Internal.k0: This session is no longer valid.
// b) System.Exception: System.AccessViolationException: Attempted to read or write protected memory.
// The latter error more often than not results in a "zombie" child RunDLL32 that doesn't die off
// when the process terminates. The former occasionally results in the "zombie".
Task[] workers = new Task[4];
string source = @"C:\temp\BigFile.html";
string target = @"C:\temp\BigFile.pdf";
for (int i = 0; i < workers.Length; ++i)
{
LogMsg("Starting Worker #{0}", i);
workers[i] = Task.Factory.StartNew(() => ConvertFile(source, target, i), cancel.Token);
}
LogMsg("Waiting for all workers to complete...");
Task.WaitAll(workers);
LogMsg("Done.");
}
static void ConvertFile(string source, string target, int workerNum)
{
try
{
EO.Pdf.HtmlToPdf.Options.AutoFitX = EO.Pdf.HtmlToPdfAutoFitMode.ShrinkToFit;
EO.Pdf.HtmlToPdf.Options.GeneratePageImages = false;
EO.Pdf.HtmlToPdf.Options.NoCache = true;
EO.Pdf.HtmlToPdf.Options.OutputArea = new RectangleF(0.25f, 0.25f, 8.0f, 10.5f);
EO.Pdf.HtmlToPdf.Options.PageSize = EO.Pdf.PdfPageSizes.Letter;
LogMsg("Worker {0}: Processing request...", workerNum);
EO.Pdf.HtmlToPdf.ConvertUrl(source, target);
}
catch (Exception ex)
{
LogMsg("Worker {0}: Exception caught: {1}", workerNum, ex);
}
finally
{
LogMsg("Worker {0}: Request Completed.", workerNum);
}
}
}
}
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi, Thank you very much for the test code. Unfortunately we have not been able to reproduce the problem here. Can you send us the test file? You can find instructions here: http://www.essentialobjects.com/forum/test_project.aspxOnce we receive the test file, we will try to do some more test here to see what we can find. In any case, performing conversion in a separate AppDomain might be a good idea for you since you will have absolute control that way. We are also adding code on our end to handle ProcessExit event in addition to DomainUnload event, so that the processes will be cleaned up even if you don't start a second AppDomain. However using a second AppDomain would give you the option to perform additional logics after the child AppDomain has been unloaded, while as relying on main AppDomain to automatically clean up does not offer this benefit. Thanks!
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
Sent e-mail with test file.
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
Any update on this?
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
Not yet. We are working on this one and your other issue together. We will post again as soon as we have an update. I apologize for the delay.
Thanks!
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
No problem! Just making a status inquiry and appreciate the update.
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
We were able to reproduce the zombie process problem with the big file you sent to us. We have posted a new build that should address this issue. In the new build:
1. The child process should always exit when such error occurs; 2. An event entry will be added into the event viewer with the actual error message (most of the time it's an access violation error); 3. The calling process will throw an exception (most of the time it's an session invalid error);
Please see your private message for the download location.
Thanks!
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
New EOTotal 110 build (with EO.Pdf 5.0.86.2) does in fact seem to address the zombie issue (at least this particular incarnation).
1) I do see that the child RunDLL32.exe instance is exiting immediatelyi on the error (no linger/idle timeout, no zombie). I do not see any signs of GDI leaks when this happens.
2) I do not see any events in the Application Event Logs nor any information in the trace debugger that contains the actual error message. Might be better to report them as inner exceptions if possible?
3) I am seeing "EO.Pdf.Internal.k0: This session is no longer valid." exceptions now instead of the previous "attempted to read or write protected memory" exceptions.
The GDI leak case does still happen when using additional AppDomains (I'll make a note of this on the GDI leak thread - I only mention it here as you previously stated you were working on both issues together).
Now that the "zombie" case has been resolved does this change at all the recommendations to call the converter from secondary AppDomain instance (completely setting aside the current GDI leaks)? Is there still a valid/good reason to use a secondary AppDomain (i.e. other cases where this could happen)?
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
Yes. We did find the root cause of the GDI leak issue but the fix didn't make into your build. We will have another build very soon that should fix the GDI leak for you.
The GDI leak indeed only happens with child app domain, but the problem is in our code, not in your code. However since now the child process exits properly, it is no longer necessary for you to create child app domains. So that may no longer be a serious issue for you.
Thanks!
|
|
Rank: Advanced Member Groups: Member
Joined: 7/14/2014 Posts: 40
|
Quote:The GDI leak indeed only happens with child app domain, but the problem is in our code, not in your code. However since now the child process exits properly, it is no longer necessary for you to create child app domains. So that may no longer be a serious issue for you. Right - that's why I was asking if your recommendation regarding running from secondary AppDomains had changed. I thought that likely I would no longer need to do this but I wanted to confirm this as you might say "yes - you still should do it for reasons X, Y, & Z...". Thanks for getting this addressed!
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
You are very welcome. We have posted build 5.0.87 (EO.Total version 2013.0.112) that should also work even if you still use child app domain. But yes I can confirm that child app domains are no longer needed for your case.
Thanks!
|
|