Getting Data from the Web (1 Viewer)

James · November 6, 2006

Hi to the MP development community,

(this includes all plugin and addon developers too)

I have just re-built WebEPG from the ground up, including a new Html Parsing engine. I have tried to keep this engine as general as possible so that it can be re-used by all other plugin, etc that need to get data from html web pages.

It is template based so that site specific information can be stored in config files and new sites added without the need to recompile.

The interface is very simple:

The constructor takes a template, the typeof a IParserData class and any args for the IParserData class.

Code:

 HtmlParser(HtmlParserTemplate template, Type dataType, params object[]dataArgs)

The main method Parses a URL and returns how many times the template occurs on this site

Code:

    int ParseUrl(HTTPRequest site);

Then the data for a index it return by this method.

Code:

    IParserData GetData(int index);

I will include more detailed doc in the wiki, in a few days.

I'm very interested to know if this is useful and any suggestions for improvement, etc.

/James

James · November 7, 2006

To try the parser engine out you can use the util I created for WebEPG, download from this thread:

https://forum.team-mediaportal.com/showthread.php?p=89426#post89426

/James

patrick · November 8, 2006

James said:
To try the parser engine out you can use the util I created for WebEPG, download from this thread:

https://forum.team-mediaportal.com/showthread.php?p=89426#post89426

/James

Hi,

Could you give example entries for all the fields?

I downloaded it but cannot figure out how to use it.
I put the URL in the "Get" textbox and clicked load, it loaded the page source in
the source textbox but that is about as far as I get.
I am guessing Start and End are where to put HTML code at the beginning and end of the area of interest but I have no idea what I should put in the Template tags and text before clicking parse.

Thanks,
patrick

James · November 8, 2006

Hi patrick,

Thanks for your interest. I do plan to add detailed info the wiki, but until I find some time I will give you the details here.

The start and end, are simply search strings to quickly filter out large parts of the html source. As sometimes the template it not totally unique on the page. Also can help speed the parsing on large pages.

Ok the template is basically html tags and parser tags

All html tags are supported, include comments 

Parser tags are ones I invented for marking the place where the interesting data is. I have 3 at the moment <#xxxx> <*xxxx> <Zxx>.

In this post I will just cover the major one <#xxx>

A simple template would look like this:

Code:

<tr>
<td><#START></td>
<td><#TITLE></td>
<td>
</tr>

The parser searches for this pattern in the HTML source and reports the number of times it finds it.

When you ask it to parse a certain occurance, it will get the text form the html source, located where the <#START> and <#TITLE> tags are and pass this into an IParserData object using the SetElement(string tag, string value) method.

In this case tag = "#START" or "#TITLE" and value will be the text located in the html source at this location. Characters can be put in front and behind the <#> tags to remove part of the text.

so "-<#START>." will search for the '-' and '.' and pass what is between these as the value string into the SetElement method. To use more then one character in front and behind you need to use the following syntax <#TAGNAME:front,back>, where front and back are search strings (either can be empty). If no search strings/characters are given, then it will go to the next tag. Of cause extra parsing can be done in the IParserData object. You just need to create a new class with this interface.

The tag names can be anything, and just need to be in your template and in the IParserData class must know what to do with them. I have made a very simple ParserData class which just stores the tag/value pair in a Dictionary, these can then be retreived by tag name later. This will take any tag and value pair.

The WebEPG IParserData class however looks like this:

Code:

switch (tag)
        {
          case "#START":
            BasicTime startTime = GetTime(element);
            break;
          case "#TITLE":
            _title = element.Trim(' ', '\n', '\t');
            break;
...

It does extra parsing of the element values, for example trimming the spaces and other junk or parsing the time values from strings.

The Tags variable, tells the parser which HTML tags are interesting, all other tags will be ignored. It is the first character of the HTML tag name.

So
"T" = all table tags
"I" = img
"D" = div
"!" = comment
etc.

I take all table tags as one group, mutliple tags can of course be given ie "TSD" (table, span, div), etc, etc.

So in this example I would use "T" as all the tags are table tags (ie starting with the letter T). This means that the real HTML source could have other tags in it, but the parser would match it because it would just ingore these tags.

General it to use a few tags as required to make the template unique to the data. Using too many tags can mean small changes require template changes. Such tags like table tags which define structure are good, because the structure doesn't often change.

I hope this helps. I will try to get a start on the wiki documentation soon.

/James

patrick · November 14, 2006

James,

Thanks for the explaination!!!

Taken me a little while to get back to this.

I think I am getting the test app to work a little better now.

Though I have not been able to get the Start and End to have any effect.

Question, say a single site has two different layouts(as far as parsing goes)
for the same type of data, is there a all or nothing test
so you would know to try another/next template if one is available?
Or does the template just need enough specifics so that no matches
would be found?

Also, just ran on one question/thing while trying the Test app.
Is there any Case sensitivity restrictions?

Noticed that when I have in the TEXT textbox:

Code:

<P ALIGN=CENTER>
<A HREF=<#URL>><#TEXT></A>
</P>

With the HTML Source:

Code:

<P ALIGN=CENTER><B><A HREF="http://www.time.gov/">Time of Day</A></B></P>

<P ALIGN=CENTER><B><a href="http://www.time.gov/">Time of Day</a></B></P>

The #URL in the first case returned:

Code:

"http://www.time.gov/"

But in the 2nd case #URL returned:

Code:

<a href="http://www.time.gov/"

In both cases the correct #TEXT returned.

Thanks again,
patrick

James · November 14, 2006

Hi Patrick,

patrick said:
Though I have not been able to get the Start and End to have any effect.

Question, say a single site has two different layouts(as far as parsing goes)
for the same type of data, is there a all or nothing test
so you would know to try another/next template if one is available?
Or does the template just need enough specifics so that no matches
would be found?

The start and end need to be in place before you load the html source from the site. The filtering in the test app is done on the loading of the page.

That depends on how big the differences are. I have support for variable templates with optional tags/sections.

However, you could also parse the page twice once with each template.

This problem with the #URL I will need to check out. (do you have a URL which you were testing?). There is also a GetHyperLink Method as part of the HtmlParser, which I use instead of using a template. There are also a SearchRegex method for searching with regex. Both of these methods operate on all the tags/text inside the template area (between first and last tag of the template).

Cheers,

/James

James · November 14, 2006

Templates explained (Part 2).

In this post I will cover the <*xxxx> and <Zxx> tags.

Starting first the <Zxx> tag.

This tag is used to make a template for a variable structure and deal with optional information. Some website add extra information by changing the html structure (ie adding extra table rows).

With this tag regex code can be used.

at <z> tag must also have an end tag </z>. This indicates the start and end of the area with is considered optional.

Example:

Code:

<tr>
<td><#START></td> 
<td><#TITLE></td> 
<z(><td><#OPTIONAL></td></z)?>
</tr>

In this example the simple regex ( )? is used to indicate that this part is optional.

In regex ? is the same as (){0,1} 0 or 1 times. At the moment the system has problems with any number greater then 1, as it causes an imblance between the template and the source. (If realy required, I can look into fixing this).

I have not needed/tried other regex code. It will except any valid regex code, but weather it parses or not if another question.

For more details on regex try this site: http://www.regular-expressions.info/

This can be used in the test program.

Next the <*xxx> tags. There are currenly only 2 <*> tags: <*MATCH> and <*VALUE>. These tags must be used as a pair.

These tags require an extra list with is passed into the HtmlParser class with the template if required.

This list has a Match value and a Field value, both strings.

This tag set is used as follows:

Code:

Template:
<table>
<z(>
<tr>
<td><*MATCH></td> 
<td><*VALUE></td>
</tr>
</z)?>
<z(>
<tr>
<td><*MATCH></td> 
<td><*VALUE></td>
</tr>
</z)?>

List:
MATCH   FIELD
Time       #TIME
Date       #DATE

In this case the parser will try to match the text located by the <*MATCH> tag with the list of match strings and the store the text located by the following <*VALUE> tag into the corresponding field.

At the moment it is not possible to test this in the test program, because it has no way for providing the required list.

Please feed back any problems or improvements.

Cheers,

/James

patrick · November 14, 2006

James,

Thanks for the fast reply!

James said:
The start and end need to be in place before you load the html source from the site. The filtering in the test app is done on the loading of the page.

Ok, that fixed the Start, but the end still missed, maybe that's me though, I will try some more. Been using HTML tags, will try something different.

James said:
This problem with the #URL I will need to check out. (do you have a URL which you were testing?).

No, I have been testing against a simple intranet page @ work.
Amazing how hard it is to find a "simple" page on the internet

Thanks again,
patrick

James · November 14, 2006

Patrick,

Ok the problem with the URLs is fixed. It was doing case sensitive matches, which it wasn't suppose to

I have also updated the test app to enable HTML code to be copied into the source box, without coming from a website. This helped me test this

Cheers,

/James

patrick · November 15, 2006

Hi James,

Me again

I think I may misunderstand the <Zxx> tags, as I cannot seem to make it work.
Assuming there is an unknown number of list items I was thinking I could just
allow optional items up to a certain number but I cannot get it to work.

HTML Source:

Code:

</tr> 
<tr>
<td>
<span class="bodytext">
2 to 2 1/2 pound strip loin, trimmed
<BR>Olive oil
<BR>Salt and freshly ground black pepper
<BR>Soft hoagie rolls, split 3/4 open
<BR>Provolone Sauce, recipe follows
<BR>Sauteed Mushrooms, recipe follows
<BR>Caramelized Onions, recipe follows
<BR>Sauteed Peppers, recipe follows

</span><p></p>

<span class="bodytext">
Place steak in freezer for 30 to 45 minutes; this makes it easier to slice the meat. Remove the meat from the freezer and slice very thinly. 

<P>Heat griddle or grill pan over high heat. Brush steak slices with oil and season with salt and pepper. Cook for 45 to 60 seconds per side.

<P>Place several slices of the meat on the bottom half of the roll, spoon some of the cheese sauce over the meat, and top with the mushrooms, onions, and peppers.

</span><p></p>
<span class="bodytext">

Tags:

Code:

SPBT

Template Text:

Code:

</tr>
<tr>
<td>
<span class="bodytext">
<#MRI0>
<BR><#MRIN1>
<BR><#MRIN2>
<BR><#MRIN3>
<BR><#MRIN4>
<BR><#MRIN5>
<z(><BR><#MRIN6></z)?>
<z(><BR><#MRIN7></z)?>
</span><p></p>
<span class="bodytext">
<#MRIS0>
<P><#MRIS1>
<P><#MRIS2>
</span><p></p>
<span class="bodytext">

No values are returned for #MRIN5, #MRIN6, #MRIN7

Can you see what I might be doing wrong?

PS I did upgrade to v3 of you test app.

Thanks again,
patrick

Getting Data from the Web (1 Viewer)

James

Retired Team Member

James

Retired Team Member

patrick

Portal Pro

James

Retired Team Member

patrick

Portal Pro

James

Retired Team Member

James

Retired Team Member

patrick

Portal Pro

James

Retired Team Member

patrick

Portal Pro

Users who are viewing this thread