[Finished] Scrapper improvements (1 Viewer)

MJGraf · November 20, 2013

Ok ok, understood, guys

I've done so much in the Backend lately that I feel comfortable enough to do some more work in that area. Priority is most likely as follows:
1. There's still a bug in the UPnP subsystem that Morpheus has found but I think I know the solution already (looks like a race condition)
2. I'd like to do a little rework in MediaLibrary (related to MP-405) to ensure that every granular action (such as detaching a client) is executed in one database transaction. But I'm not sure yet whether this interferes with Valks' rework on virtual data access. @Valks are you around? Any plans to continue your work in the next weeks?
3. Then I think it's time for a rework of ImportWorker to (a) maybe make it multithreaded and (b) finally implement a two-pass import to make the first import much faster
4. Finally the MetadataExtractors have to be improved
I know, this is still a long way to go, but as usual in MP2, when we do it, we do it right

As to the fourth point: does anyone know whom to ask re scraper implementation in MP1 / MovingPictures? Not talking about the scrapers themselves here but about the parser that reads the scraper files and executes them. @RoChess can you help here? All the MetadataExtractors in MP2 are currently written in C# but from MP1 I take it that we have a lot of people developing very good scripted scrapers but these guys wouldn't feel comfortable in C# as I understand from some posts. Would be nice IMHO if we could have scripted MetadataExtractors in MP2 as well.
Does that sound like a plan?

RoChess · November 20, 2013

@MJGraf, that would be more of a question for fforde, but I do know that @trevor used the MovPic scraper system and integrated it into mvCentral. That might be an easier way to look at the parsing engine code (I didn't look, but it might all be in a single changeset on his SVN project site).

From a functional point of view, it is roughly split up in 3 sections, first the search node will get all the results that match a search based on title/tt-ID/etc, and the results are fed back to MovPic in an array so that user can confirm/correct matches found. Then the selection is fed to the details node to obtain all the textual info (summary/crew/etc) and finally the artwork nodes to get cover/backdrop/etc.

Scraper-scripts can compliment eachoter when nodes are missing (or info not found). For example for IMDb+ I added a cover node, but I rely on TMDb backdrop node to get the actual fanarts. Adjusting the node order allows one script to overrule another, so Fanart.tv can be used for covers even when IMDb+ supports getting covers.

I actually planned to look at how MediaPortal 1.x was doing scrapers for MyVideos in C# (when free time permits me), and see if it is in my wheelhouse to convert IMDb+ over. If you plan to add an XML engine into MP2. alike to what MovPic currently uses, then I can use that time to focus more on improving IMDb+

And you are right, @Merlyn for example created FilmInfo+, a German variation of IMDb+, and has the time to maintain the XML based scraper-script, but would not be able/interested to do the same if it was in C#.

morpheus_xx · November 21, 2013

Short note about "foreign covers" this can be changed quite easily: I currently also download "neutral" covers, which are obviously more Russian or similar.

We could use only English as fallback, if no cover in current culture is available.

morpheus_xx · November 21, 2013

To the "cover for same movie name in different years" problem:
This is a known issue that consists of two parts:
- online lookup: if a MDE extracted a year and name, this combination is used for lookup. If a IMDBid was found, it gets preferred.
- cover loading: the FanartImageSource only uses a single name to find a cover. This leads to ambiguous lookup results (or none). Only unique key could be the IMDBid, if present.

MJGraf · November 21, 2013

Just realized that I misspelled Valk - so the question above was at @Valk

Thanks a lot RoChess. I had a look at the scraper engine of MovingPictures last night and to me it seems not too difficult to integrate it as a separate plugin into MP2 which only provides a "scraping-service" to other parts of MP2. That way all the integrated MetadataExtractors could make use of it. Another approach would be to integrate the scraper engine as kind of a dummy MetadataExtractor itself that needs the scripts to provide any results.

But this decision mainly depends on how we want to make our MetadataExtractors configurable by users. We currently only have the choices you all know "Music", "Pictures", "Video", "Movies" and "Series" whereas when you choose Movies or Series you automatically also choose Videos.
I personally like the idea of having such an easy selection within the MP2 client and I would therefore propose that we shift any more detailed configuration into settings, which for now can be changed via XML configuration files, later on via a configuration frontend.
Any suggestions on what should be configurable and how? Shall we offer "sources" for metadata and let the user choose "I want this metadata collected by this source"? If so, how fine grained should this choice be? One source for every single attribute of a MediaItemAspect? One source for every MediaItemAspect would probably be not enough. A source could be "file name", "file tags", "separate local files" (such as Cover.jpg, folder.jpg), "MusicBrainz", "TheMovieDB", etc. Next thing could be that we can have multiple sources for one attribute with each a priority.
That way a user could say "For movies I want the title from the 'file tags" and if there is no title in it I want it from 'file name'. The cover should be taken from 'file tags', if there is none take it from 'separate local files' and if there is none, try getting it from 'MusicBrainz'".
We just need ideas of all possible use cases to make sure that we do not make a design decision that limits us later on...
Of course we would need a good "standard" configuration that works out of the box ("MP2 feeling"...) but I suppose that this is one of the areas where people want at least the possibility to do as much configuration as possible.

Lehmden · November 21, 2013

Hi.
Thanks to all for having a look at this.

There is no "easy" way to get a well shaped Movie Database imho. This hasn't worked for MP1 nor XBMC or WMC. You have to put a lot of manual work into, if you want it "perfect". But that's OK to me. Who wants it "easy" can use "normal" online scrapping like it is now. Maybe a lot of people are happy with. But for those who wants more, the best and easiest way is to use external tools like Ember or Tiny Media Manager. There you prepare your movie once and can use the data and artwork you've chosen as "the right one" with every HTPC software. No automatic, how good it ever may be, can really know what the user wants. In best case it "guesses" quite well. Normally the metatdata are put into a so called NFO file. In fact this is a XML file with all the necessary data... But we can use MKV Metatags as well. At least tMM is able to write Matroska Tags to XML which could be added to the MKV with MKVPropedit really fast. Didn't know if Ember can do the same as I'm using tMM...

So we need a "two way strategy" here, I think. Try to make the automatic online scrapping as good as possible and add an option to use the manual collected data an user has prepared laboriously. Then we add to wiki that if a user wants it perfect he unfortunately need to put some manual work into it.

And we definitely need a "sorttitle" or "sort by" MIA. This is the only way I know to get the collection sorted the way I want it. Any other Soft (MovPic, MyVideos, XBMC,...) has such a field as it really is essential. tMM and Ember are supporting this also.

Valk · November 21, 2013

MJGraf said:
Ok ok, understood, guys
2. I'd like to do a little rework in MediaLibrary (related to MP-405) to ensure that every granular action (such as detaching a client) is executed in one database transaction. But I'm not sure yet whether this interferes with Valks' rework on virtual data access. @Valks are you around? Any plans to continue your work in the next weeks?

Yes I'm trying to find time to get back onto this ASAP as it stands I should be resuming work Monday. As for will it be a problem I'm not sure yet but I'll look into it and get back to you.

riggnix · January 2, 2014

I modded the SeriesMetadataExtractor to understand season subfolders.
It's a pretty dirty "patch", so it will no longer work without subfolders.

It only understands this format: "Series\Season 1\S01E01 Title" (i actually use "Staffel" instead of "Season", but it should still work)

What I did: Remove all Regexes from the SeriesMatcher (MediaPortal\Source\Extensions\MetadataExtractors\SeriesMetadataExtractor\NameMatchers\SeriesMatcher.cs: Line 44-60) and use this Regex instead:

Code:

new Regex(@"(?<series>[^\\]*)\\[^\\]*(?<seasonnum>\d+)[^\\]*\\S(?<seasonnum>\d+)E(?<episodenum>\d+)\s*-*\s*(?<episode>.*)\.", RegexOptions.IgnoreCase)

I thought I'd share it here, for people who don't want to recompile

Use ONLY with Alpha 4!
Just extract it to "C:\Program Files (x86)\Team MediaPortal" (or wherever you installed MP2).
You may want to backup the original file first (MP2-Server\Plugins\SeriesMetadataExtractor\SeriesMetadataExtractor.dll)

morpheus_xx · January 3, 2014

Thank you for sharing this! But removing all the other expression would also remove the 90% successful detections.

I would prefer if you only add another expression to the end of the list. If any of the existing regexps match an invalid series name before, please post it here (along with folder/filename example) and I will try to change it.

riggnix · January 3, 2014

morpheus_xx said:
Thank you for sharing this! But removing all the other expression would also remove the 90% successful detections.

I would prefer if you only add another expression to the end of the list. If any of the existing regexps match an invalid series name before, please post it here (along with folder/filename example) and I will try to change it.

That's what I did at first, but it would match all my season folders as individual shows. It may help to move it up in the list, but i actually didn't care to do that, because this one regex matches all my series

[Finished] Scrapper improvements (1 Viewer)

MJGraf

Retired Team Member

RoChess

Extension Developer

morpheus_xx

Retired Team Member

morpheus_xx

Retired Team Member

MJGraf

Retired Team Member

Lehmden

Retired Team Member

Valk

Portal Pro

riggnix

Portal Pro

Attachments

morpheus_xx

Retired Team Member

riggnix

Portal Pro

Users who are viewing this thread