Sheesh... I'm getting a headache from all of this.
Well, it appears that the problem *is* Shedules Direct's Tribune Media data, but it appears that the data issues have decreased dramatically since I started trying to figure this all out.
When I first started looking into this issue a couple of weeks ago most first run programs had previously-shown tags... in my initial tests (on a short list of stations... I'm OTA exclusively) I came up with nearly 400 instances where a first run show had a previously-shown tag.
In the process of trying to pin down the source of the problem today I found
a perl script that easily downloads raw data from Schedules Direct. Picking through that file I realized that the first several first-run shows I spot-checked had the elusive "new" attribute correctly set to true.
I then cross checked the raw data against a newly downloaded TVXML file, and discovered that each of these shows had produced properly-formatted "programme" blocks... i.e. no previously-shown tag! By this time my head was spinning... so I went ahead and ran the new TVXML file through my markNew.py script. Well, it turns out there are still problem shows, but only a fraction of what there were. In the end I found 14 first run shows with previously-shown tags. I cross checked these with the raw data file, and they were indeed missing the crucial "new='true'" attribute.
The legitimately mismarked shows boiled down to 3 series ("American Idol", "Dancing with the Stars" and "Are You Smarter Than a 5th Grader?"), a handful of sporting events (golf, poker, basketball & racing) and the Academy of Country Music Awards.
By the way, I encountered an unforseen issue... my script was modifying about 50 other programme blocks -- turns out they were for shows that run one new episode a day with one or more repeats later. An example: "The Newshour with Jim Lehrer" airs one new episode each weekday at 4pm, then runs repeats every 2-4 hours until the next new episode. On that show SD's data is a mess (several of the repeats are marked "new"), but the point is that my script marked any repeats before midnight as "first run," because their previously-shown embedded date matched the "programme start" date.
Rather than implement some twisted logic to fix the data for these repeated shows (which I seldom watch), the latest version of my script (attached) just ignores program entries belonging to categories that tend to have shows like these. You can customize the excluded categories as you see fit.