Suggestion to use XBMC's XML scrapers for HTTP scraping (1 Viewer)

Gamester17 · February 13, 2008

I would like to suggest that MediaPortal should start using XBMC's XML scrapers for HTTP media information scraping.

XBMC today has a very nice generic API for letting anyone create and/or maintain XML and Perl Compatible Regular Expression (PCRE RegExp) based parsing scrapers without prior programming knowledge, scrapers that scrape HTTP websites for metadata (sites such as IMDb.com, TheTVDB.com, TV.com, AllMusic.com, and many more), metadata for Movies, TV-Shows, Music-Videos, and Music, and XBMC scrapers parse that metadata before entering it into the database library. I think that it would be great if MediaPortal could integrate that same parser API so that it could use XBMC scraper XML files as is, and vice versa, so cross-compatible with both applications, (later even other open source media center, such as maybe MeediOS, will catch on so that they could all share this library/interface and use the same RegEx XML scraper files).

This scraper API should in theory only require that you first integrate basic XML parser and PCRE RegEx parser into MediaPortal (which I assume already exist and if so only a hook should be needed), and then you will have to convert XBMC's ScraperParser.cpp from C++ to C# code in order for MediaPortal to use it nativly.

You can download the XBMC source code from the SVN, instructions can be found here:
http://sourceforge.net/projects/xbmc

The existing scapers can be found in the SVN under "/branches/linuxport/XBMC/system/scrapers/"
(Please note that IMDb.xml is the best scraper of these to use as a reference).
Find the C++ source code for XBMC parser in "/branches/linuxport/XBMC/xbmc/utils/ScraperParser.cpp"

If you do not have a SVN tool then download vi the web-interface here:
SourceForge.net XBMC SVN Repository - /trunk/XBMC/system/scrapers/video
SourceForge.net XBMC SVN Repository - /trunk/XBMC/system/scrapers/music

SourceForge.net XBMC SVN Repository - /trunk/XBMC/xbmc/utils/ScraperParser.cpp

More about the scraper function can be found in the XBMC wiki:
Category of wiki articles tagged as "Scraper" related
Scraper.xml structure
How To Write Media Info Scrapers for XBMC
Scrap (Scrap.exe for testing of scrapers under Windows)
How To use Scrapers
TV Shows handling in XBMC
Music Videos handling in XBMC

What do you Team-MediaPortal developers think about this idea?

PS! For those unfamiliar with XBMC you will find good overview in the wikipedia article:
http://en.wikipedia.org/wiki/XBMC

FlipGer · February 13, 2008

Hi,

thanks alot for the hint. The developers will take a look into it.

Flip.

Gamester17 · February 14, 2008

FYI, I posted the same idea on MeediOS forum and got a discussion started:
MeediOS :: View topic - Suggestion to use XBMC's XML scrapers for HTTP scraping

They might have a few ideas on improvements or changes to keep track of

Gamester17 · February 22, 2008

Just a heads-up; a couple of XBMC developer are now currently in the process of finalizing a new scraper which uses the slightly updated scraper API in XBMC, so if anyone plans on integrating this into MediaPortal today then maybe you should wait for just a couple weeks first, (otherwise your work will have to be redone again later).

The new API will be PCRE (Perl Compatible Regular Expressions) compatible, that will allow PCRE RegEx to be used in the XML files which should make for a faster, simpler and more user-friendly RegEx by those working on XML scrapers. So if you like to be one step ahead then you might want to implement a PCRE parser library, and/or PCRE support in MediaPortal's existing RegEx parser.

Perl Compatible Regular Expressions - Wikipedia, the free encyclopedia
PCRE - Perl Compatible Regular Expressions

PS! Know that some existing scrapers that are available in XBMC's SVN are not currently working, this is simply because the website that they scape have been changed in a way so that someone will need to update those XML scrapers for them to work again, (which none one have done yet if they are broken at any given time). However if MediaPortal (and possible MeediOS as well) starts using the same scraper API then we should all together be able to make a better job of keeping all the available scrapers up-to-date.

Gamester17 · February 23, 2008

As I posted FYI information on the MeediOS forum though I should post it here too:

Team-XBMC have plans (in the not so soon future) to also implement the same scraper API and similar XML-files for music (and other audio files) metadata scraping of the internet, and later maybe even when other type of metadata scraping is needed (like weather-forecasts, and XMLTV EPG TV-Guide scraping, etc.), in order to if possible make it a unified scraper API throughout a media center application like XBMC. Again, that is not on the 'soon future' roadmap as many other things have a higher priority and there are only so many hours in a day, ...so far only the concept theory have been written down on paper then we have put it aside for now.

By the way, I think that the optimal solution would be if our projects someday could come to a compromise to make each individual XML scraper 100% compatible so they could be used in each media center application without modifications, that way maybe in the future we could start a new common project (like on sourceforge.net or code.google.com) where we could host and maintain these "HTPC XML scrapers", ...it should then be simple to make each media center application automatically check and download updated scrapers from that common project, which IMHO would be very user-friendly.

Gamester17 · August 30, 2008

FYI; XBMC now uses its same generic scrapers (importers) API that I initially described in the first post for not only Movies (and Porn), but also for TV-Shows, Music Videos, and Music. They automatically download Posters, Album Cover Art, Banners, Screenshots, and Fan Art from multiple sites within on scraper (importer), and it is has multi-lingual support.

These two new HOT-TO guides are recommended read for this:
HOW-TO Write Media Info Scrapers (introduction): http://xbmc.org/wiki/?title=HOW-TO_Write_Media_Info_Scrapers_(introduction)
HOW-TO Write Media Info Scrapers (the complete dummies guide): http://xbmc.org/wiki/?title=HOW-TO_Write_Media_Info_Scrapers_(the_complete_dummies_guide)

Again, I think should consider reusing this same API in MediaPortal so that we should share the scraper XML files 8)

panic · September 9, 2008

is this still considered to get added to MP?

Nicezia · May 21, 2009

In the case that there is any interest remaining in this

In the case that there is any interest remaining in bring the XBMC scraper format to MP

I just wanted to inform you that i've actualy created a .net library that would make it easy to implement.
So far it only has support for the XBMC movie Scrapers, but I'm working on the other media types

I've made the process as simple as possible to implement a program using the library simply needs to

A) Send the Movie name (and optionally the year, which improves accuracy of matches) to the CreateSearchUrl function CreateSearchUrl("The Breakfast Club", "1985)

B) Select from the results returned

C) Manage the details returned for the selected movie

All data returned from the library is string data (in xml element format)

Gamester17 · May 25, 2009

Hope MediaPortal developers cooperate and join this collaboration effor for standards

Nicezia said:
In the case that there is any interest remaining in bring the XBMC scraper format to MP

I just wanted to inform you that I've actually created a .net library that would make it easy to implement.
So far it only has support for the XBMC movie Scrapers, but I'm working on the other media types

I've made the process as simple as possible to implement a program using the library simply needs to

A) Send the Movie name (and optionally the year, which improves accuracy of matches) to the CreateSearchUrl function CreateSearchUrl("The Breakfast Club", "1985)

B) Select from the results returned

C) Manage the details returned for the selected movie

All data returned from the library is string data (in xml element format)

FYI; Nicezia's ScraperXML library now has support for Movies, TV Shows, Music Videos, and Music, ...he is a also planning scraper support for PC Games for XBMC's future game library.

ScraperXML library C# .NET code is open source under GPL and can be downloaded here:
http://sourceforge.net/projects/scraperxml

More discussion about this library is taking place in the XBMC Community Forums:
ScraperXML (Open Source XML Scraper .NET DLL Library), please help verify my work... - XBMC Community Forum

MeediOS and Meedio plugin developers are also looking to implement this for all scraping purposes:
MeediOS :: View topic - Suggestion to use XBMC's XML scrapers for HTTP scraping

Also the "Unified Media Manager" project plans on using this ScraperXML library for metadata scraping:
https://forum.team-mediaportal.com/...t-who-here-wants-help-code-new-project-59399/

Hope that MediaPortal developers join this effort to create an open and common standard for metadata scraping and shared scraper XML files.

fforde · May 25, 2009

I can't speak for the MediaPortal guys, but on the Moving Pictures project, I spent a lot of time looking into the XBMC scraper system before we implemented our own generic Cornerstone Scraper Engine. I have not looked too closely at the new ScraperXML project (although I did take a peek and by the way it is written in Visual Basic, not C#). But if it works similar to or is based on the older C++ scraper engine for XBMC it has a couple problems.

The method of outputting results with the XBMC scripting engine is fairly cryptic. The script writer has to basically construct an XML document for the output. And what makes this even worse is this construction is embedded in an existing XML document, which means all special characters must be escaped. This dramatically complicates things, reducing the maintainability of existing scripts and making new scripts much more difficult to write.
The output of the search function is only a text based string and a url for the details page for the movie, tv show, etc. The string is not even consistent, sometimes it is title, sometimes it is title and year, sometimes it is official title, english title, then year. It's just not reliable. This would create difficulties with the auto approval system in Moving Pictures.
The way the scraper engine is written, it's just not that extensible. What if someone wanted to add XML parsing tools to be provided to the scraper? What if someone wants to execute XSLT queries on a web page found on the internet? Or what about pulling details from the filesystem rather than a URL? These are not features I particularly care about, but my point is it would not be easy to add these features with the way the XBMC stuff is coded. The core of the scraper engine would have to be modified and this would bring risk to existing functionality, because everything as it is, is lumped together in a big class / file.

For these reasons I chose not to get involved with the XBMC scraper engine a while back. Instead we created the Cornerstone Scraper Engine (also GPL) that powers Moving Pictures. I think that a community effort to create a common data provider system for multiple HTPC apps is a good idea, but if the project is going to base the engine on the XBMC implementation, I am unfortunately not really interested in getting involved.

Suggestion to use XBMC's XML scrapers for HTTP scraping (1 Viewer)

Gamester17

Portal Pro

FlipGer

Retired Team Member

Gamester17

Portal Pro

Gamester17

Portal Pro

Gamester17

Portal Pro

Gamester17

Portal Pro

panic

Portal Pro

Nicezia

Portal Member

Gamester17

Portal Pro

fforde

Community Plugin Dev

Users who are viewing this thread