SiteParser: Force a specific charset ?? (1 Viewer)

doskabouter · November 13, 2010

ScRePt said:
Couldn't you add an optional option to the generic parser like the "forceUTF8Encoding" so that the matcher behaves the described way?
Implementing a duplicate of the generic parser so that 10 lines are changed (one more loop for parse, categories, subcategories) could lead to future incompatibilities ...

not needed, I just figured it out:
put
(?<!class=2.*)(?<=class=1.*) before your regex, and only the matches between "class=1" and "class=2" will match

ScRePt · November 13, 2010

Oh, you are good!!! It worked
I bumped on an other problem: The site seems to dynamically loading it's content. As a result, the DOM I am trying to parse is not the same as the "live" DOM I'm seeing on the browser. Did you ever come to this problem? how did you solve this ?

Example link

(look for addPrototypeElement)

doskabouter · November 16, 2010

You could use a html-sniffer like fiddler2 to figure what html's are loaded. One of those should contain the video

ScRePt · November 17, 2010

Even if I sniff the video source, there is no way to parse the categories for the videos since the final DOM is not built and is not available for parsing. I am wondering if this was ever an issue for other sites.

offbyone · November 17, 2010

If the final page is build from secondary requests (e.g. using ajax), and the data coming from those results is what you need in the first place, you should use the urls of those requests with your regex, as they contains the data you need. If you need the data in the primary html to make the ajax request, it's now time to build your own util in c# that does so. This can in no way be solved with one generic util without making it overly complex.

SiteParser: Force a specific charset ?? (1 Viewer)

doskabouter

Development Group

ScRePt

Portal Pro

doskabouter

Development Group

ScRePt

Portal Pro

offbyone

Development Group

Users who are viewing this thread