Using Google news to fill your Autoblog

Status
Not open for further replies.
Emp - I'm better now. :)

I think its actually been working as described for a while now but I was doing one thing a little differently. Namely, instead of searching in news.g I was clicking one of the Category links on the left (e.g. 'Sports', 'Entertainment', etc.) then 'RSS'.

The category feed is a little different. In particular, one line in the resulting POST HTML is very long and contains at least 2 matching URLs. Regex rips out from the first 'news.google.com' through all the good HTML in between to the last 'amp;cid='. So ... a bunch of good content gets ripped out.

I still have to use PhpMyAdmin to fix up my RegEx's because " always ends up as \" in the DB, but I can live with that.
 


Remember kids, you can run it through feedburner to clean up the url. *hint hint*

F! Why didn't you say so earlier! To hell with rewriting. I'm pretty convinced now it's not working 100% of the time for anyone. That "double url" issue I posted above happens both in G news categories (as I mentioned) but also in searches *sometimes*. e.g. "news google com/news?hl=en&ned=us&q=celebrity&btnG=Search+News" <-- the first few entries will have links to "all xx related items" or whatever which hoses the regex.

Wonder if and/or when Feedburner (owned by G) will scan for this kind of feed.

By the way - the latest version of wp-o doesn't seem to work with WPMU. Talking about the bleeding edge release from right on the homepage of devthought.com where it says, "Update 2: Newest version here. WP-o-Matic is slowly reaching perfection".
 
I don't think Feedburner is a solution either. URLs inside the feed are still news.goog.

Here's a picture that kind of makes the problem easy to see (screenshot of Regex Buddy by the way).

ImageShack - Hosting :: regexvn3.gif

The entire table is the post wp-o creates (with no regex specified). The shaded areas show where the regex will replace Goog URLs - light colors (e.g light yellow) will be the entire regex match, the darker (e.g. dark yellow) the group ($1) that would be used to rewrite. Notice the one outlined in red at the bottom - that is where the regex is grabbing too much text because there are multiple news.google. URLs on one line.

My regex is weak so if someone knows how to take care of the issue in red... The screenshot is using this:

Code:
http:..news.google.com.*amp;url=(.*).amp;cid=[^"]*
 
Code:
http:..news.google.com.*amp;url=(.*).amp;cid=[^"]*


This worked perfect for me! The formatting for some of the articles (from E!) was off, but for the most part, it is building the links correctly!

Now, to add MY tracking script(still to be written) to redirect them... :-)
 
EMP,
I have this for my regex box:
Code:
/http...news.*amp;url=(.*)&cid.*\"{1}/

But if my rss headline has an apostrophe (') it gets replaced with this:
Code:
;#39;

what can i do to have the apostrophe kept.

Heres an example of my google news rss:

Baseball's Mitchell Report: Steroid Use Doesn’t Discriminate, But …
 
will this software get you nailed for dup content? Does this have any seo value whatsoever?
 
will this software get you nailed for dup content? Does this have any seo value whatsoever?

You won't get penalized for dup content. Thousands of newspapers report AP news feeds without getting penalized.

I would say the biggest problem is that summary feeds scream spam- fulltext feeds is where it's at.
 
Depends, I have ran it with only one feed, and it 'seemed' to run forever. Actually, I think a refresh just was not working, as the # posts had not gone up. I just clicked the wp-o-matic tab again, and started with the next one.


Now, if you are loading up with 10-50 feeds, then it's going to take some time! :-)
 
Never stops for me either, but I also get the impression that the script simply does not do the refresh.

I normally let it run in the back for however long I forget about it and then click on "manage" and.. lo and behold! There are the new articles.

::emp::
 
Status
Not open for further replies.