|
Rank: Newbie Groups: Member
Joined: 10/10/2015 Posts: 7
|
My intention is to build a scraping engine on top of EO.WebBrowser using the off-screen WebView component with ThreadRunner. To implement a pool of scrapers, I need to run multiple WebView controls in parallel. It seems ThreadRunner is single-threaded and incapable of handling more than one WebView process at a time. Hence, I am creating separate ThreadRunner-WebView combo within each spawned threads. My setup in brief: Windows 10 x64, VS 2015, .net 4.6, 32-bit console project, EO.WebBrowser 2015.2.85.0 nuget package Here's my sample console app:
Code: C#
using System;
using System.Threading.Tasks;
using EO.WebBrowser;
using static System.Console;
namespace eo_test
{
internal class Program
{
private static void Main(string[] args)
{
//using (var globalRunner = new ThreadRunner())
{
var urls = new[]
{
"http://httpbin.org/get",
"http://google.com/",
"http://bing.com/",
"http://yahoo.com/",
"http://facebook.com/",
"http://twitter.com/",
"http://cnn.com/",
"http://bbc.com/",
"http://aol.com/",
"http://web.com/"
};
var tasks = new Task[urls.Length];
for (var i = 0; i < tasks.Length; i++)
{
//var tmpRunner = globalRunner;
var url = urls[i];
tasks[i] = Task.Factory.StartNew(() => runBrowser( /*tmpRunner, */url));
}
WriteLine("Waiting for tasks to finish...");
Task.WaitAll(tasks);
WriteLine("All done. Press any key to exit...");
ReadKey();
WriteLine("Shutting down...");
}
}
private static void croak(string host, string msg)
{
WriteLine($"{DateTime.Now.ToString("hh:mm:ss")} - {host:0,-16} - {msg}");
}
private static void runBrowser( /*ThreadRunner globalRunner, */ string url)
{
var host = new Uri(url).Host;
host = host.Substring(0, host.LastIndexOf('.'));
using (var runner = new ThreadRunner(host))
{
using (var view = runner.CreateWebView())
{
var tmpView = view;
runner.Send(() =>
{
croak(host, "Loading....");
tmpView.LoadUrlAndWait(url);
croak(host, "Webpage loaded");
if (tmpView.CanEvalScript)
{
croak(host, "Title: " + (string) tmpView.EvalScript("document.title"));
}
return null;
});
croak(host, "Done.");
view.Close(true);
}
runner.Stop();
}
}
}
}
Unfortunately, the code above never finishes. Of the given tasks only the first one or two are completed, after that the application goes into idle mode indefinitely.
Code:
Waiting for tasks to finish... 09:43:20 - aol - Loading.... 09:43:20 - google - Loading.... 09:43:20 - httpbin - Loading.... 09:43:20 - cnn - Loading.... 09:43:20 - facebook - Loading.... 09:43:20 - bing - Loading.... 09:43:20 - bbc - Loading.... 09:43:20 - web - Loading.... 09:43:20 - twitter - Loading.... 09:43:20 - yahoo - Loading.... 09:43:22 - httpbin - Webpage loaded 09:43:22 - httpbin - Title: 09:43:22 - httpbin - Done. 09:43:22 - bing - Webpage loaded 09:43:22 - bing - Title: Bing 09:43:22 - bing - Done. 09:43:24 - google - Webpage loaded 09:43:24 - google - Title: Google 09:43:24 - google - Done. 09:43:24 - facebook - Webpage loaded 09:43:24 - facebook - Title: Facebook - Log In or Sign Up 09:43:24 - facebook - Done.
Observations: 1) Even though several parallel threads have been spawned, the browser tasks are executed sequentially as if in a single thread (evident in the console output) 2) After several minutes of waiting, I've noticed in the VS debugger some threads are being exited with status code of 0. I have no idea whether these are CLR/TPL threads or native libCEF threads... What am I doing wrong here?
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi, We have confirmed this to be an issue. Please change your code that creates and destroy WebView from:
Code: C#
using (var view = runner.CreateWebView())
{
.....
}
To:
Code: C#
var view = runner.CreateWebView();
try
{
....
}
finally
{
view.Destroy();
}
The reason is when you use using, it calls Component.Dispose, which places a lock on the webView object. This is an undesired side effects that caused the deadlock. This will remove the deadlock however will still cause other issues for you, notablly you will get an "Native window not destroyed" error when your program exit. We are working on both issues (1. Component.Dispose places a lock on the WebView, 2. Native window not destroyed) and will post a new build as soon as possible. Thanks!
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
This is just to let you know that we have posted a new build that should address both problems and your original code should work fine with the new build. You can download the new build from our download page.
Thanks!
|
|
Rank: Newbie Groups: Member
Joined: 10/10/2015 Posts: 7
|
Whoa! That was fast! Now the application completes all the tasks:
Code:
Waiting for tasks to finish... 01:43:12 - aol - Loading.... 01:43:12 - bbc - Loading.... 01:43:12 - yahoo - Loading.... 01:43:12 - cnn - Loading.... 01:43:12 - web - Loading.... 01:43:12 - twitter - Loading.... 01:43:12 - facebook - Loading.... 01:43:12 - httpbin - Loading.... 01:43:12 - bing - Loading.... 01:43:12 - google - Loading.... 01:43:22 - httpbin - Webpage loaded 01:43:22 - httpbin - Title: 01:43:22 - httpbin - Done. 01:43:23 - bing - Webpage loaded 01:43:23 - bing - Title: Bing 01:43:23 - bing - Done. 01:43:35 - google - Webpage loaded 01:43:35 - google - Title: Google 01:43:35 - google - Done. 01:43:49 - facebook - Webpage loaded 01:43:49 - facebook - Title: Facebook - Log In or Sign Up 01:43:49 - facebook - Done. 01:43:53 - web - Webpage loaded 01:43:53 - web - Title: Website Builder | Web.com 01:43:53 - web - Done. 01:43:54 - twitter - Webpage loaded 01:43:54 - twitter - Title: Welcome to Twitter - Login or Sign up 01:43:54 - twitter - Done. 01:43:59 - yahoo - Webpage loaded 01:43:59 - yahoo - Title: Yahoo 01:43:59 - yahoo - Done. 01:44:03 - cnn - Webpage loaded 01:44:03 - cnn - Title: Breaking News, U.S., World, Weather, Entertainment & Video News - CNN.com 01:44:03 - cnn - Done. 01:44:05 - bbc - Webpage loaded 01:44:05 - bbc - Title: BBC - Homepage 01:44:05 - bbc - Done. 01:45:09 - aol - Webpage loaded 01:45:09 - aol - Title: AOL - News, Sports, Weather, Entertainment, Local & Lifestyle 01:45:09 - aol - Done. All done. Press any key to exit...
Thanks! Version 15.2.88.0 seems to have eliminated the race condition. Which brings me to the next, albeit harder problem. The output above indicates the existence of a global lock in EO.WebBrowser codebase, which is causing the browser tasks to be performed sequentially, even though they all have their own parallel ThreadRunner instances. I want to execute multiple off-screen WebView tasks in parallel. I've modified my test code, but this doesn't fix the parallelism problem:
Code: C#
private static void runBrowser( /*ThreadRunner globalRunner, */ string url)
{
var host = new Uri(url).Host;
host = host.Substring(0, host.LastIndexOf('.'));
using (var runner = new ThreadRunner(host))
{
using (var view = runner.CreateWebView())
{
var tmpView = view;
runner.Post(() =>
{
croak(host, "Loading....");
var task = tmpView.LoadUrl(url);
task.OnDone(() =>
{
croak(host, "Title: " + (string) tmpView.EvalScript("document.title"));
Thread.Sleep(50); // relinquish timeslice to other threads?
croak(host, "Cookie: " + (string) tmpView.EvalScript("document.cookie"));
});
task.WaitOne();
croak(host, "Done.");
tmpView.Close(true);
});
}
// TODO: Wait for runner to finish??
runner.Stop();
}
}
How can I run multiple concurrent WebView tasks in a threadpool?
|
|
Rank: Newbie Groups: Member
Joined: 10/10/2015 Posts: 7
|
Okay, I have come up with yet more convoluted experiment in my quest to achieve parallelism with ThreadRunner+WebView. Here's the new stress-testing code:
Code: C#
using System;
using System.Threading;
using System.Threading.Tasks;
using EO.WebBrowser;
using static System.Console;
namespace eo_test
{
internal class Program
{
private static void Main(string[] args)
{
//using (var globalRunner = new ThreadRunner())
{
var urls = new[]
{
"http://httpbin.org/get",
"http://google.com/",
"http://bing.com/",
"http://yahoo.com/",
"http://facebook.com/",
"http://twitter.com/",
"http://cnn.com/",
"http://bbc.com/",
"http://aol.com/",
"http://web.com/"
};
var motherTasks = new Task[urls.Length];
for (var i = 0; i < motherTasks.Length; i++)
{
//var tmpRunner = globalRunner;
var url = urls[i];
motherTasks[i] = Task.Factory.StartNew(() => runBrowser( /*tmpRunner, */url));
}
WriteLine("Waiting for tasks to finish...");
Task.WaitAll(motherTasks);
WriteLine("All done. Press any key to exit...");
ReadKey();
WriteLine("Shutting down...");
}
}
private static void croak(string id, string msg)
{
WriteLine($"{DateTime.Now.ToString("hh:mm:ss")} - {id.PadRight(12)} - {msg}");
}
private static void runBrowser( /*ThreadRunner globalRunner, */ string url)
{
var host = new Uri(url).Host;
host = host.Substring(0, host.LastIndexOf('.'));
using (var runner = new ThreadRunner(host))
{
const int NUM_MINIONS = 4;
var minions = new Task[NUM_MINIONS];
var tmpRunner = runner;
for (var i = 0; i < NUM_MINIONS; i++)
{
var id = host + "-" + i;
minions[i] = Task.Run(() => runWebView(tmpRunner, url, id));
}
Task.WaitAll(minions);
// TODO: Wait for runner to finish??
runner.Stop();
}
}
private static void runWebView(ThreadRunner runner, string url, string taskId)
{
using (var view = runner.CreateWebView())
{
var tmpView = view;
runner.Post(() =>
{
croak(taskId, "0> Loading....");
var urlTask = tmpView.LoadUrl(url);
urlTask.OnDone(() =>
{
// simulate a series of javascript DOM activities
croak(taskId, "1> Title: " + (string) tmpView.EvalScript("document.title"));
Thread.Sleep(0); // relinquish timeslice to other threads?
var cookie = (string) tmpView.EvalScript("document.cookie");
croak(taskId, "2> Cookie: " + cookie.Substring(0, 16));
Thread.Sleep(0);
var jsTask = Task.Factory.StartNew(() =>
{
var code = @"var e = document.createElement('div');
e.id = 'my-test-id';
document.body.appendChild(e);
document.getElementById('my-test-id').id;";
croak(taskId, "3> Custom JS: " + (string) tmpView.EvalScript(code));
});
jsTask.Wait();
});
urlTask.WaitOne();
croak(taskId, "4> Done.");
tmpView.Close(true);
});
}
}
}
}
In the new code, each thread runner instance is spawning off multiple sub-tasks. Observations:* The application is still executing in sequential mode. In runWebView() method steps "1> through "4>" are being performed in a batch. I have deliberately sprinkled Thread.Sleep() in between the steps which should yield CPU to other concurrent threads - it doesn't achieve the desired effect. * If the number of sub-tasks (see the NUM_MINIONS constant in runBrowser() method) is reasonably low then the application completes normally. On my Windows 10 x64, 10GB RAM, .net 4.6 machine the safe NUM_TASKS value seems to be 6 or less. * NUM_MINIONS greater than 6 causes the application to hang after completing the first batch of tasks (the runWebView() method will execute successfully once for each ThreadRunner instance, any subsequent calls will hang) * If I set NUM_MINIONS to 15 or higher (again this is on my machine, YMMV) then either of the two happens: - the "Channel disconnected" crash dialog appears. - or the application hangs after completing the first batch of operations
|
|
Rank: Newbie Groups: Member
Joined: 10/10/2015 Posts: 7
|
Apparently, some runWebView() tasks are failing silently. Slightly improvised code:
Code: C#
using System;
using System.Threading;
using System.Threading.Tasks;
using EO.WebBrowser;
using static System.Console;
namespace eo_test
{
internal class Program
{
private const int NUM_MINIONS = 4;
private static int completionCount;
private static void Main(string[] args)
{
//using (var globalRunner = new ThreadRunner())
{
var urls = new[]
{
"http://alexa.com/",
"http://google.com/",
"http://bing.com/",
"http://yahoo.com/",
"http://facebook.com/",
"http://twitter.com/",
"http://cnn.com/",
"http://bbc.com/",
"http://aol.com/",
"http://web.com/"
};
var motherTasks = new Task[urls.Length];
for (var i = 0; i < motherTasks.Length; i++)
{
//var tmpRunner = globalRunner;
var url = urls[i];
motherTasks[i] = Task.Factory.StartNew(() => runBrowser( /*tmpRunner, */url));
}
WriteLine("Waiting for tasks to finish...");
Task.WaitAll(motherTasks);
WriteLine("All done. Press any key to exit...");
WriteLine($"Total tasks: {NUM_MINIONS*motherTasks.Length}");
WriteLine($"Completed tasks: {completionCount}");
ReadKey();
WriteLine("Shutting down...");
}
}
private static void croak(string id, string msg)
{
WriteLine($"{DateTime.Now.ToString("hh:mm:ss")} - {id.PadRight(12)} - {msg}");
}
private static void runBrowser( /*ThreadRunner globalRunner, */ string url)
{
var host = new Uri(url).Host;
host = host.Substring(0, host.LastIndexOf('.'));
using (var runner = new ThreadRunner(host))
{
var minions = new Task[NUM_MINIONS];
var tmpRunner = runner;
for (var i = 0; i < NUM_MINIONS; i++)
{
var id = host + "-" + i;
minions[i] = Task.Run(() => runWebView(tmpRunner, url, id));
}
Task.WaitAll(minions);
// TODO: Wait for runner to finish??
runner.Stop();
}
}
private static void runWebView(ThreadRunner runner, string url, string taskId)
{
using (var view = runner.CreateWebView())
{
var tmpView = view;
runner.Post(() =>
{
croak(taskId, "0> Loading....");
var urlTask = tmpView.LoadUrl(url);
urlTask.OnDone(() =>
{
while (!tmpView.CanEvalScript)
{
Thread.Sleep(500);
}
// simulate a series of javascript DOM activities
croak(taskId, "1> Title: " + (string) tmpView.EvalScript("document.title"));
Thread.Sleep(0); // relinquish timeslice to other threads?
var cookie = (string) tmpView.EvalScript("document.cookie");
croak(taskId, "2> Cookie: " + cookie.Substring(0, 16));
Thread.Sleep(0);
var jsTask = Task.Factory.StartNew(() =>
{
var code = @"var e = document.createElement('div');
e.id = 'my-test-id';
document.body.appendChild(e);
document.getElementById('my-test-id').id;";
croak(taskId, "3> Custom JS: " + (string) tmpView.EvalScript(code));
});
jsTask.Wait();
});
urlTask.WaitOne();
croak(taskId, "4> Done!");
Interlocked.Increment(ref completionCount);
tmpView.Close(true);
});
}
}
}
}
Note: the "httpbin.org" url is removed since it didn't return a valid HTML document. Output:
Code:
All done. Press any key to exit... Total tasks: 40 Completed tasks: 38
facebook-1 and facebook-2 are failing for some reason. How do I enable WebView debug logging? The complete output log is here.
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
This is just to let you know that we are still working on this issue. We will reply again when we have an update.
Thanks!
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi, This is just to let you know that we have posted a new build that addressed the deadlock issue. You can download the new build from our download page. As to other issues: 1. The "Channel disconnected" issue. This can happen due to out of memory. In this case the child process will crash and you will receive "Channel disconnected" issue. This is simply a system overload and you should lower down the number of parallel task you are trying to execute; 2. The "sequence" issue. This is due to the fact that we use a single internal thread to drive notification messages. This and the fact that the time between load to load complete (between 0 and 1) is much longer than the total amount of time spent on other tasks after load completes (step 1, 2, 3, 4) causes the perception that tasks are running in sequence. That is not the case because even though the notification messages are driving by one thread, the actual work is still done by multiple threads (in fact by multiple processes). You can confirm this by observing that the task that runs simple pages (such as Google and Bing) will completes its iteration much sooner than the task that runs more complicated pages (such as Yahoo and CNN); 3. The "slient failing" issue. This is due to the following line in your code:
Code: C#
croak(taskId, "2> Cookie: " + cookie.Substring(0, 16));
We have observed sometime cookie is shorter than 16 characthers and in that case, the above line will throw an exception thus not reaching the end for this task. Hope this addresses all the issues. Please feel free to let us know if you still have any more issues. Thanks!
|
|
Rank: Administration Groups: Administration
Joined: 5/27/2007 Posts: 24,229
|
Hi,
Our additional test revealed too issues with your test code that can cause other problems:
1. You should use runner.Send instead of runner.Post. This is because if you use runner.Post, the WebView can be destroyed (due to the using statement) before the delegate passed to Post finishes;
2. You can not use jsTask.Wait in your code. You can just remove the jsTask altogether and call tmpView.EvalScript directly. This is because here jsTask.Wait is called by the ThreadRunner's thread. The ThreadRunner must never stops pumping message --- otherwise deadlock may occur. All waiting function provided by the WebView internally pumps message. For example, EvalScript will wait until its done and it will pump messages while it's waiting. So it's not necessary to use Task.Factory.StartNew to create a separate task for it;
Once you make those two changes in your code, your sample should run without any problem.
Thanks!
|
|