Suggestion to use XBMC's XML scrapers for HTTP scraping

Discussion in 'Improvement Suggestions' started by Gamester17, February 13, 2008.

  1. Gamester17

    Gamester17 Portal Pro

    Joined:
    May 12, 2004
    Messages:
    98
    Likes Received:
    3
    Occupation:
    x86 Servers Technical Support Engineer
    Location:
    Sweden
    Ratings:
    +3 / 0
    Home Country:
    Sweden Sweden
    I would like to suggest that MediaPortal should start using XBMC's XML scrapers for HTTP media information scraping.



    XBMC today has a very nice generic API for letting anyone create and/or maintain XML and Perl Compatible Regular Expression (PCRE RegExp) based parsing scrapers without prior programming knowledge, scrapers that scrape HTTP websites for metadata (sites such as IMDb.com, TheTVDB.com, TV.com, AllMusic.com, and many more), metadata for Movies, TV-Shows, Music-Videos, and Music, and XBMC scrapers parse that metadata before entering it into the database library. I think that it would be great if MediaPortal could integrate that same parser API so that it could use XBMC scraper XML files as is, and vice versa, so cross-compatible with both applications, (later even other open source media center, such as maybe MeediOS, will catch on so that they could all share this library/interface and use the same RegEx XML scraper files).

    This scraper API should in theory only require that you first integrate basic XML parser and PCRE RegEx parser into MediaPortal (which I assume already exist and if so only a hook should be needed), and then you will have to convert XBMC's ScraperParser.cpp from C++ to C# code in order for MediaPortal to use it nativly.

    You can download the XBMC source code from the SVN, instructions can be found here:
    http://sourceforge.net/projects/xbmc

    The existing scapers can be found in the SVN under "/branches/linuxport/XBMC/system/scrapers/"
    (Please note that IMDb.xml is the best scraper of these to use as a reference).
    Find the C++ source code for XBMC parser in "/branches/linuxport/XBMC/xbmc/utils/ScraperParser.cpp"

    If you do not have a SVN tool then download vi the web-interface here:
    SourceForge.net XBMC SVN Repository - /trunk/XBMC/system/scrapers/video
    SourceForge.net XBMC SVN Repository - /trunk/XBMC/system/scrapers/music

    SourceForge.net XBMC SVN Repository - /trunk/XBMC/xbmc/utils/ScraperParser.cpp

    More about the scraper function can be found in the XBMC wiki:
    Category of wiki articles tagged as "Scraper" related
    Scraper.xml structure
    How To Write Media Info Scrapers for XBMC
    Scrap (Scrap.exe for testing of scrapers under Windows)
    How To use Scrapers
    TV Shows handling in XBMC
    Music Videos handling in XBMC


    What do you Team-MediaPortal developers think about this idea?

    PS! For those unfamiliar with XBMC you will find good overview in the wikipedia article:
    http://en.wikipedia.org/wiki/XBMC
     
    • Like Like x 1
  2. Google AdSense Guest Advertisement



    to hide all adverts.
  3. FlipGer
    • Premium Supporter

    FlipGer Retired Team Member

    Joined:
    April 27, 2004
    Messages:
    2,658
    Likes Received:
    115
    Location:
    Leipzig, Germany
    Ratings:
    +115 / 0
    Home Country:
    Germany Germany
    Show System Specs
    Hi,

    thanks alot for the hint. The developers will take a look into it. :)

    Flip.
     
  4. Gamester17

    Gamester17 Portal Pro

    Joined:
    May 12, 2004
    Messages:
    98
    Likes Received:
    3
    Occupation:
    x86 Servers Technical Support Engineer
    Location:
    Sweden
    Ratings:
    +3 / 0
    Home Country:
    Sweden Sweden
  5. Gamester17

    Gamester17 Portal Pro

    Joined:
    May 12, 2004
    Messages:
    98
    Likes Received:
    3
    Occupation:
    x86 Servers Technical Support Engineer
    Location:
    Sweden
    Ratings:
    +3 / 0
    Home Country:
    Sweden Sweden
    Just a heads-up; a couple of XBMC developer are now currently in the process of finalizing a new scraper which uses the slightly updated scraper API in XBMC, so if anyone plans on integrating this into MediaPortal today then maybe you should wait for just a couple weeks first, (otherwise your work will have to be redone again later).

    The new API will be PCRE (Perl Compatible Regular Expressions) compatible, that will allow PCRE RegEx to be used in the XML files which should make for a faster, simpler and more user-friendly RegEx by those working on XML scrapers. So if you like to be one step ahead then you might want to implement a PCRE parser library, and/or PCRE support in MediaPortal's existing RegEx parser.

    Perl Compatible Regular Expressions - Wikipedia, the free encyclopedia
    PCRE - Perl Compatible Regular Expressions

    PS! Know that some existing scrapers that are available in XBMC's SVN are not currently working, this is simply because the website that they scape have been changed in a way so that someone will need to update those XML scrapers for them to work again, (which none one have done yet if they are broken at any given time). However if MediaPortal (and possible MeediOS as well) starts using the same scraper API then we should all together be able to make a better job of keeping all the available scrapers up-to-date.
     
  6. Gamester17

    Gamester17 Portal Pro

    Joined:
    May 12, 2004
    Messages:
    98
    Likes Received:
    3
    Occupation:
    x86 Servers Technical Support Engineer
    Location:
    Sweden
    Ratings:
    +3 / 0
    Home Country:
    Sweden Sweden
    As I posted FYI information on the MeediOS forum though I should post it here too:

    Team-XBMC have plans (in the not so soon future) to also implement the same scraper API and similar XML-files for music (and other audio files) metadata scraping of the internet, and later maybe even when other type of metadata scraping is needed (like weather-forecasts, and XMLTV EPG TV-Guide scraping, etc.), in order to if possible make it a unified scraper API throughout a media center application like XBMC. Again, that is not on the 'soon future' roadmap as many other things have a higher priority and there are only so many hours in a day, ...so far only the concept theory have been written down on paper then we have put it aside for now.

    By the way, I think that the optimal solution would be if our projects someday could come to a compromise to make each individual XML scraper 100% compatible so they could be used in each media center application without modifications, that way maybe in the future we could start a new common project (like on sourceforge.net or code.google.com) where we could host and maintain these "HTPC XML scrapers", ...it should then be simple to make each media center application automatically check and download updated scrapers from that common project, which IMHO would be very user-friendly.
     
  7. Gamester17

    Gamester17 Portal Pro

    Joined:
    May 12, 2004
    Messages:
    98
    Likes Received:
    3
    Occupation:
    x86 Servers Technical Support Engineer
    Location:
    Sweden
    Ratings:
    +3 / 0
    Home Country:
    Sweden Sweden
    FYI; XBMC now uses its same generic scrapers (importers) API that I initially described in the first post for not only Movies (and Porn), but also for TV-Shows, Music Videos, and Music. They automatically download Posters, Album Cover Art, Banners, Screenshots, and Fan Art from multiple sites within on scraper (importer), and it is has multi-lingual support.

    These two new HOT-TO guides are recommended read for this:
    HOW-TO Write Media Info Scrapers (introduction): http://xbmc.org/wiki/?title=HOW-TO_Write_Media_Info_Scrapers_(introduction)
    HOW-TO Write Media Info Scrapers (the complete dummies guide): http://xbmc.org/wiki/?title=HOW-TO_Write_Media_Info_Scrapers_(the_complete_dummies_guide)

    Again, I think should consider reusing this same API in MediaPortal so that we should share the scraper XML files 8)
     
  8. panic

    panic Portal Pro

    Joined:
    March 24, 2006
    Messages:
    67
    Likes Received:
    0
    Occupation:
    Student
    Ratings:
    +0 / 0
    Home Country:
    Germany Germany
    is this still considered to get added to MP?
     
  9. Nicezia

    Nicezia Portal Member

    Joined:
    February 14, 2006
    Messages:
    5
    Likes Received:
    0
    Ratings:
    +0 / 0
    In the case that there is any interest remaining in this

    In the case that there is any interest remaining in bring the XBMC scraper format to MP

    I just wanted to inform you that i've actualy created a .net library that would make it easy to implement.
    So far it only has support for the XBMC movie Scrapers, but I'm working on the other media types

    I've made the process as simple as possible to implement a program using the library simply needs to

    A) Send the Movie name (and optionally the year, which improves accuracy of matches) to the CreateSearchUrl function CreateSearchUrl("The Breakfast Club", "1985)

    B) Select from the results returned

    C) Manage the details returned for the selected movie

    All data returned from the library is string data (in xml element format)
     
  10. Gamester17

    Gamester17 Portal Pro

    Joined:
    May 12, 2004
    Messages:
    98
    Likes Received:
    3
    Occupation:
    x86 Servers Technical Support Engineer
    Location:
    Sweden
    Ratings:
    +3 / 0
    Home Country:
    Sweden Sweden
    Hope MediaPortal developers cooperate and join this collaboration effor for standards

    FYI; Nicezia's ScraperXML library now has support for Movies, TV Shows, Music Videos, and Music, ...he is a also planning scraper support for PC Games for XBMC's future game library.

    ScraperXML library C# .NET code is open source under GPL and can be downloaded here:
    http://sourceforge.net/projects/scraperxml

    More discussion about this library is taking place in the XBMC Community Forums:
    ScraperXML (Open Source XML Scraper .NET DLL Library), please help verify my work... - XBMC Community Forum

    MeediOS and Meedio plugin developers are also looking to implement this for all scraping purposes:
    MeediOS :: View topic - Suggestion to use XBMC's XML scrapers for HTTP scraping

    Also the "Unified Media Manager" project plans on using this ScraperXML library for metadata scraping:
    https://forum.team-mediaportal.com/...t-who-here-wants-help-code-new-project-59399/

    Hope that MediaPortal developers join this effort to create an open and common standard for metadata scraping and shared scraper XML files.

    :D
     
  11. fforde

    fforde Community Plugin Dev

    Joined:
    June 7, 2007
    Messages:
    2,666
    Likes Received:
    1,690
    Occupation:
    Software Engineer
    Location:
    Texas
    Ratings:
    +1,696 / 0
    Home Country:
    United States of America United States of America
    I can't speak for the MediaPortal guys, but on the Moving Pictures project, I spent a lot of time looking into the XBMC scraper system before we implemented our own generic Cornerstone Scraper Engine. I have not looked too closely at the new ScraperXML project (although I did take a peek and by the way it is written in Visual Basic, not C#). But if it works similar to or is based on the older C++ scraper engine for XBMC it has a couple problems.

    1. The method of outputting results with the XBMC scripting engine is fairly cryptic. The script writer has to basically construct an XML document for the output. And what makes this even worse is this construction is embedded in an existing XML document, which means all special characters must be escaped. This dramatically complicates things, reducing the maintainability of existing scripts and making new scripts much more difficult to write.
    2. The output of the search function is only a text based string and a url for the details page for the movie, tv show, etc. The string is not even consistent, sometimes it is title, sometimes it is title and year, sometimes it is official title, english title, then year. It's just not reliable. This would create difficulties with the auto approval system in Moving Pictures.
    3. The way the scraper engine is written, it's just not that extensible. What if someone wanted to add XML parsing tools to be provided to the scraper? What if someone wants to execute XSLT queries on a web page found on the internet? Or what about pulling details from the filesystem rather than a URL? These are not features I particularly care about, but my point is it would not be easy to add these features with the way the XBMC stuff is coded. The core of the scraper engine would have to be modified and this would bring risk to existing functionality, because everything as it is, is lumped together in a big class / file.

    For these reasons I chose not to get involved with the XBMC scraper engine a while back. Instead we created the Cornerstone Scraper Engine (also GPL) that powers Moving Pictures. I think that a community effort to create a common data provider system for multiple HTPC apps is a good idea, but if the project is going to base the engine on the XBMC implementation, I am unfortunately not really interested in getting involved.
     
    • Like Like x 1
Loading...

Users Viewing Thread (Users: 0, Guests: 0)

  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.
    Dismiss Notice
  • About The Project

    The vision of the MediaPortal project is to create a free open source media centre application, which supports all advanced media centre functions, and is accessible to all Windows users.

    In reaching this goal we are working every day to make sure our software is one of the best.

             

  • Support MediaPortal!

    The team works very hard to make sure the community is running the best HTPC-software. We give away MediaPortal for free but hosting and software is not for us.

    Care to support our work with a few bucks? We'd really appreciate it!