Suggestion to use XBMC's XML scrapers for HTTP scraping (1 Viewer)

tourettes

Retired Team Member
  • Premium Supporter
  • January 7, 2005
    17,301
    4,800
    I can't speak for the MediaPortal guys, but on the Moving Pictures project, I spent a lot of time looking into the XBMC scraper system before we implemented our own generic Cornerstone Scraper Engine. I have not looked too closely at the new ScraperXML project (although I did take a peek and by the way it is written in Visual Basic, not C#). But if it works similar to or is based on the older C++ scraper engine for XBMC it has a couple problems.

    Thank you for providing some deeper analysis on the scarper project (I myself haven't just had any free time to do so. Also my knowledge is pretty limited on the problem domain, maybe because I have always hated the XML, HTML and their nasty friends :p).

    I would really like to see some common project between all (free) HTPC applications that could be used for the data pulling, but unfortunately at least based on fforde's comments ScraperXML looks much too limited to the MediaPortal II usage. So, maybe MPII will contain the GPL data provider engine from Moving Pictures. Who knows, as its relatively easy to change the used providers...
     

    Nicezia

    Portal Member
    February 14, 2006
    5
    0
    I don't think i'm really qualified to defend XBMC scraper format, however every other scraper format i've seen seems to be limited by the need to have programming skills and if not limited by the type of data that can be gathered by it, provisioned for only one type with no want for expansion, and no posibility for drop in usability... everyone's so proprietary, and defensive about their formats, personally, i started making this because i thought it'd be a good way to add a expandable, non-proprietary, info scraper to media manager apps (like the one i'm creating myself) that requires the user only to know regular expressions, which one can learn in the span of an afternoon (but no one ever masters the damn things), and how to put together an xml.


    There's nothing out there at the moment that has that kind of flexibility, hell i'm even adding to the already available library of things that this format can scrape, if people don't want flexibility and want to stick to what they know that's fine with me, if no one else cares to use XBMC's scraper format I still will because i see the potential for expansion, i can offer what i have as a standard, but i can't force it on them.

    And honestly, if i did want to go to bat for XBMC's scraper format, pulling info from the filesystem would be as easy as writing a scraper to do so if i'm understanding the format properly, In the case of runing thier quieries or whatever it is they wanted to run, at least scraperXML in this case returns XML formatted info which they could run whatever kind of quieries they wanted.
    maybe when i'm done with XBMC's scraper compatibility, i'll investigate your format, and then add compatibility with it, maybe the way to bring everyone under the same roof is to support all the proprietary systems in one library, give the user a choice of how to update their information from online sources.

    Quote:
    1. The method of outputting results with the XBMC scripting engine is fairly cryptic. The script writer has to basically construct an XML document for the output. And what makes this even worse is this construction is embedded in an existing XML document, which means all special characters must be escaped. This dramatically complicates things, reducing the maintainability of existing scripts and making new scripts much more difficult to write.



    XBMC's scraper format requires less of a lerning curve than programming. If neccessary i can provide a program that i've written that builds a RegExp Block (with expression, and the option to add all information from that block all you have to do is write your info without worrying about excaping characters and it spits out the RegExp. The options for both the RegExp block And the expression can be set, by clicking their corresponding controls. Its what was origionally going to be the scraper editor... I could probably even add predefined regular expressions, with descriptions to make it easier. See image: http://i694.photobucket.com/albums/vv306/Nicezia/temp.jpg
    Mind you i am a self-taught amateur programmer, and as soon as i figure out a way to cascade information in treeform and make it manageable, i will be making a full blown scraper editor. My goal as a whole is to make things EASIER to maintain.

    Secondly even doing it programmatically you would have to gather information together in one place, to be returned to the program, why not XML? or do you evaluate each piece of info individually?


    Quote:
    2. The output of the search function is only a text based string and a url for the details page for the movie, tv show, etc. The string is not even consistent, sometimes it is title, sometimes it is title and year, sometimes it is official title, english title, then year. It's just not reliable. This would create difficulties with the auto approval system in Moving Pictures.


    Not true, the output of the CreateSearchUrl function is always a url ... the output of each function is consistent, i think his concern, like mine is what is put into each function, and how to make a descision on which is the best match to the input. I can see how that would be a valid concern. But since the output of each function is consistent
    CreateSearchUrl: <url>whatever-page-link</url> (i know some output without the <url> tags but i think this should be standard for this function, and have made it standard with all the new type scrapers i'm working on)
    GetSearchResults: <results><entity><title/><url/><id/><what/><ever/><else/><you/><want/></entity>........</results>
    GetDetails: <details><all/><the/><details/><you/><could/><possibly/><gather/></details>
    I don't know why you consider it would be hard to auto-accept an search result based on its <title> or <id> or <url> or <aired> or whaever else
    the information is returned XML, which means its quieriable, even sortable and comparable.



    Quote:
    3. The way the scraper engine is written, it's just not that extensible. What if someone wanted to add XML parsing tools to be provided to the scraper? What if someone wants to execute XSLT queries on a web page found on the internet? Or what about pulling details from the filesystem rather than a URL? These are not features I particularly care about, but my point is it would not be easy to add these features with the way the XBMC stuff is coded. The core of the scraper engine would have to be modified and this would bring risk to existing functionality, because everything as it is, is lumped together in a big class / file.


    Valid arguements i'd say, but then you can't possibly program to allow EVERYTHING to be done, or you run the risk of making it much to complicated for the COMMON user to work with, and lets face it... in the world of HTPC the target user is a COMMON user, there are more non-programmers that want to use these apps(some straying because of the learning curve - believe me i myself walked away from media portal at one point because it was to hard to get it to do what i wanted it to do, even with having gotten the recomended hardware and drivers, i seriously found myth TV to be eaiser to configure and maintain - but i haven't actually explored Media portal itself in a while, considering that now linux is my system of choice) than there are programmers, and the majority of people from my point of view, want a smaller learning curve for their HTPC, so they spend less time getting it to do what they want and more time enjoying the fact that it does what they want.
    __________________



    I must state for the record, i am independently writing this library, I have no affiliation nor do i speak for them, the statements i have made are my own opinions and reflect only my own thoughts.
     

    tourettes

    Retired Team Member
  • Premium Supporter
  • January 7, 2005
    17,301
    4,800
    Valid arguements i'd say, but then you can't possibly program to allow EVERYTHING to be done, or you run the risk of making it much to complicated for the COMMON user to work with, and lets face it... in the world of HTPC the user is a COMMON user, there are more non-programmers using these apps than there are programmers, and the majority of people from my point of view, want a smaller learning curve for their HTPC so they spend less time getting it to do what they want and more time enjoying the fact that it does what they want.

    I would say that normal users shouldn't ever have to bother with the scapping templates / regexp / C++ or what ever the data retrieval system is using to gather the metadata. All he/she should have to do is to select the source for the data and that should be enough. It should be enough that the community has few active users that can and will update the templates or whatever is used.

    Something pretty closely related is that normal user shouldn't ever have to configure importers as complex as Meedio / MeediOS have :p
     

    fforde

    Community Plugin Dev
    June 7, 2007
    2,667
    1,702
    42
    Texas
    Home Country
    United States of America United States of America
    1. The method of outputting results with the XBMC scripting engine is fairly cryptic. The script writer has to basically construct an XML document for the output. And what makes this even worse is this construction is embedded in an existing XML document, which means all special characters must be escaped. This dramatically complicates things, reducing the maintainability of existing scripts and making new scripts much more difficult to write.
    XBMC's scraper format requires less of a lerning curve than programming. If neccessary i can provide a program that i've written that builds a RegExp Block (with expression, and the option to add all information from that block all you have to do is write your info without worrying about excaping characters and it spits out the RegExp. The options for both the RegExp block And the expression can be set, by clicking their corresponding controls. Its what was origionally going to be the scraper editor... I could probably even add predefined regular expressions, with descriptions to make it easier. See image: http://i694.photobucket.com/albums/vv306/Nicezia/temp.jpg
    You misunderstand me. Look at this:
    Code:
    <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url&gt;http://jadedvideo.com/yz_resultJAVA.asp?PRODUCT_ID=\2&lt;/url&gt;&lt;/entity&gt;" dest="5"
    Instead of just storing the URL in a variable (maybe named $URL) you have to build a wrapping XML block using escaped brackets. This is sloppy unneeded and complicates the process of writing a script. I don't care about a program you are writing that hides these problems. It still makes the scripts more difficult to work with.
    2. The output of the search function is only a text based string and a url for the details page for the movie, tv show, etc. The string is not even consistent, sometimes it is title, sometimes it is title and year, sometimes it is official title, english title, then year. It's just not reliable. This would create difficulties with the auto approval system in Moving Pictures.
    Not true, the output of the search function is always a url (or am i wrong)... the output of each function is consistent, i think his concern, like mine is what is put into each function, and how to make a descision on which is the best match to the input. I can see how that would be a valid concern. But since the output of each function is consistent
    CreateSearchUrl: <url>whatever-page-link</url> (i know some output without the <url> tags but i think this should be standard for this function, and have made it standard with all the new type scrapers i'm working on)
    GetSearchResults: <results><entity><title/><url/><id/><what/><ever/><else/><you/><want/><entity>........</results>
    GetDetails: <details><all/><the/><details/><you/><could/><possibly/><gather/></details>
    I don't know why you consider it would be hard to auto-accept an search result based on its <title> or <id> or <url> or <aired> or whaever else
    the information is returned XML, which means its quieriable, even sortable and comparable.
    The standard search function is limited in that it only returns a title and url item in the returned XML. To get around this some scripts put the year in parenthesis in the title tag. As you mentioned XML should be a structured format, well defined and easily searchable by XSLT. The way XBMC handles search results though limits the amount of data that is returned, and the only work around for this detroys any benefits gained from returning XML (even though I don't agree with the returned XML approach anyway as I describe in item #1). Additional information in search results, as I mentioned above, can be valuable for automatic matching purposes. Release year would be the most common value, but this could be different for different types of data retrieved (movies, tv-shows, weather, etc).
    3. The way the scraper engine is written, it's just not that extensible. What if someone wanted to add XML parsing tools to be provided to the scraper? What if someone wants to execute XSLT queries on a web page found on the internet? Or what about pulling details from the filesystem rather than a URL? These are not features I particularly care about, but my point is it would not be easy to add these features with the way the XBMC stuff is coded. The core of the scraper engine would have to be modified and this would bring risk to existing functionality, because everything as it is, is lumped together in a big class / file.
    Valid arguements i'd say, but then you can't possibly program to allow EVERYTHING to be done, or you run the risk of making it much to complicated for the COMMON user to work with, and lets face it... in the world of HTPC the user is a COMMON user, there are more non-programmers using these apps than there are programmers, and the majority of people from my point of view, want a smaller learning curve for their HTPC so they spend less time getting it to do what they want and more time enjoying the fact that it does what they want.
    No. Adding more features does not mean the method of writing a script is more complicated, it means the users have more tools to work with. People can easily start small with basic page retieval and regex parsing, then move on to XSLT, looping, sorting, etc as they become more experienced, improving their scripts. Requiring the use of additional options would be bad. Offering the use of additional options makes your scripting engine mroe flexible. My point though was not that the XBMC scraper didn't have enough options. It was that it is too difficult to expand upon. It does not have a framework for creating new aspects of the scripting language without modifying the core. I think this is a limitation.
     

    Nicezia

    Portal Member
    February 14, 2006
    5
    0
    I am not here to convince you to use my library or to adopt the XBMC scraper format, all i'm saying is. Guess what, open source's whole point is so that anyone can contribute to the process of making it better, some people can't write scripts, some people can't program (11 months ago i couldn't even read a script and tell you what it was doing, i couldn't even read a program and understand the slightest thing about it, but i could write a simple little xml to make XBMC scrape data from the site I WANTED my info from. because it was based on simple principles that can be learned in a days time.... I think information gathereing should not be confined to the elite who can write scripts and know programming langage, if you don't want to use it that's fine with me. I'm just making an Option, and doing everything i can to put the power of information gathering into the hands of the end user and not leaving it in the hands of those who can write a script or program.

    That's all i have to say,
    Have a nice day
     

    fforde

    Community Plugin Dev
    June 7, 2007
    2,667
    1,702
    42
    Texas
    Home Country
    United States of America United States of America
    Nicezia, I don't neccesarily disagree with anything you just said. I am just pointing out why I am personally not interested in working on a community project that uses the XBMC scraper as a base. I think there are several problems with it (which I have pointed out above) that make it more difficult to expand, and more difficult to script. And by more difficult to script I mean more difficult for anyone to script. This include people new to the scraper engine. I am not suggesting that information gathering should be "confined to the elite". I think a scripting engine should be as simple to use as possible. Which is sort of my point, the XBMC scraper engine is not as simple to use as possible.

    If a community project is created to design a common scrapper engine, I am all for that. But I do not think the XBMC scraper engine is the best foundation. Please don't take this personally, and please do not take this as a criticism of XBMC in general. I am honestly trying to be objective here. If it makes you feel any better, I think the scraper engine in MediaPortal 1 sucks too. It basically is just C# code loaded at runtime. This has a lot of issues as well and that is why on the Moving Pictures project we chose not to reuse it in the same way we chose to not reuse the XBMC scraper engine.
     

    Nicezia

    Portal Member
    February 14, 2006
    5
    0
    Well it be a nice idea to come up with something.

    Basically now its a opetion between the user having to know something about scripting or programming
    Or Dealing with XBMC's xml format... i personally love XBMC because its when i wrote my first scraper i didn't have to learn anything but regular expressions and to set up the xml...


    I just remember thinking back before istarted teaching myself to program... i wish i could get XBMC center to get information from my favorite site, on my movies, then after a trip to the wiki finding out i could, my first thought being - yeah but i'm going to have to learn how to script VB, or C++ or something like that.. when i found out all i had to know was regular expressions and how to format an xml (and to be able to follow a bouncing buffer.) It thrilled me to be able to contribute... as well as to be able to make my program do what I wanted it to do.

    I think that kind of power should be in everyone's hands, and i know that the XBMC has some difficulties, sure i screwed up some entities writing the xml scrapers, so i'm making every possible way to make it easy on anyone who wants to incorporate it to make it easy for anyone who wants to modify or create a scraper for this format, to do so...

    I know open source as i understand was brought about so that one wasn't reliant on the creaters or maintainers of a project to get the program to do what they wanted, me, i'm trying to even extend a little of that feeling of freedom to even those who haven't got the skill for scripting.

    My library is an option, and while you don't have to adopt it as your standard, its going to be out there with many tools to make it easier for the average user to get info from whereever HE wants, without having to break down and learn the principles of programming...

    Lets just say its a personal quest for me, i am learing to program so others don't have to and can still contribute.

    You misunderstand me. Look at this:
    Code:
    <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url&gt;http://jadedvideo.com/yz_resultJAVA.asp?PRODUCT_ID=\2&lt;/url&gt;&lt;/entity&gt;" dest="5"
    Instead of just storing the URL in a variable (maybe named $URL) you have to build a wrapping XML block using escaped brackets. This is sloppy unneeded and complicates the process of writing a script. I don't care about a program you are writing that hides these problems. It still makes the scripts more difficult to work with.
    The standard search function is limited in that it only returns a title and url item in the returned XML. To get around this some scripts put the year in parenthesis in the title tag. As you mentioned XML should be a structured format, well defined and easily searchable by XSLT. The way XBMC handles search results though limits the amount of data that is returned, and the only work around for this detroys any benefits gained from returning XML (even though I don't agree with the returned XML approach anyway as I describe in item #1). Additional information in search results, as I mentioned above, can be valuable for automatic matching purposes. Release year would be the most common value, but this could be different for different types of data retrieved (movies, tv-shows, weather, etc).
    Then its as simple as putting in a RegExp to gather the Release Year, or whatever you want, add another regular expression and change the output to
    Code:
    <RegExp input="$$1" output="&lt;entity&gt;&lt;title&gt;\1&lt;/title&gt;&lt;url&gt;http://jadedvideo.com/yz_resultJAVA.asp?PRODUCT_ID=\2&lt;/url&gt;&lt;released&gt;\3&lt;/released&gt;&lt;/entity&gt;" dest="5"


    Its that simple, you can gather, and return whatever information you want in any function , which is what i find is the beauty of XBMC scrapers, its limited only by what the user tells it to return, assigning a tag is not different than assigning a variable.

    I promise you i could rewrite the scrapers to return whatever information you wanted as long as that info is on the given page, even if that info relies on another page, in another domain, in an xml file on hte harddrive, on the local network, anthing that can be loaded as string can be used for information.


    However, i've just read a post over in XBMC's where spiff(the guy who maintains the Scraper code) said in response to your points
    1) see nicezia's answer
    2) the nice thing is all is xml driven. we can change anything, in fact i'm very much prepared to change what info is passed into the scraper etc. i haven't put much consideration into generality, since thus far the only thing i've had to worry about is xbmc. just write up what you consider a sane standard, and it will be considered. + since everything is xml, you can add attributes as you see fit without "hurting" other parsers which does not support them.
    3) uhm, xbmc can scrape fine from local files. it's an url, not necessarily http. also i'd like to mention that the "big" class in question consists of approximately 500 lines of c++ of which 50 is comments and 200 is "stupid" initialization code etc. as for functionality breaking; again, it's xml. it's a very limited "language" and i have still not have one thing break on me, other than the scrapers themself which is part of the nature of web scrapers.
     

    fforde

    Community Plugin Dev
    June 7, 2007
    2,667
    1,702
    42
    Texas
    Home Country
    United States of America United States of America
    If spiff or whoever wants me to clarify what I have said or thinks that I misunderstand the XBMC scraper engine, I am happy to chat with them, but I would prefer to keep things on a single forum. This is where the discussion was brought up and I have no desire to split the conversation across the internet.

    For what it's worth I think the project you are trying to put together is admirable, and I hope you succeed in creating a stand-alone cross application scraper engine that is flexible and easy to use. I just think that the XBMC scraper is not the best base to use to accomplish these goals, and for that reason I am not interested in participating.
     

    Nicezia

    Portal Member
    February 14, 2006
    5
    0
    I also admire your project, and of course as i said, i might look into your scraping format when i am satisfied that mine fully supports the XBMC protocols, and add the functionality to my project, I wouldn't mind supporting more than one format in my library, its not about loyalty to a format its about putting the power to get things done in easy reach. I've looked over your code, and it wouldn't be hard to add in (and for the record, my code right now is bloated and has alot of unneccessary stuff, due to a lack of understanding in the onset of this project of the inner workings of XBMC's scraper code (I'm not so good with reading C++ so basically everything in my code is from input from spiff on how things work in XBMC, As soon as i'm satisfied with everything the code is going to be converted to C#)
     

    fforde

    Community Plugin Dev
    June 7, 2007
    2,667
    1,702
    42
    Texas
    Home Country
    United States of America United States of America
    Well let me know if you are interested in including our scraper engine used by Moving Pictures. It already is compiled to a separate DLL so it would probably require very little new coding, you could just link the library. Programmatically the input and output are just Dictionaries of key value pairs as well so it would be very easy to take this and convert it to whatever output you wanted to interface with other programs.
     

    Users who are viewing this thread

    Top Bottom