Dear PHP,

rage9, I don't see why we need to write in the same language. I don't know ruby, and php = lol. You use what you want, I use what I want, the best code wins. I'm thinking something along the lines of yellowpages, which has lots of data to scrape. Maybe reads a text file, that has a category per line to scrape, goes through all pages of that category, dump results to a CSV. I'd say important things are the ability to recover from a page not loading, proxy support. I can scale to a couple hundred threads, so good luck beating me on speed with php :)

Latency will most likely be the speed limitation if you're single threaded.

"Latency will most likely be the speed limitation if you're single threaded."

My PHP forks can chop up your multi threads! Scared yet? Should be! :rasta:
 


Lirbraries + templating system = framework.

As a developer, you end up rolling your own libraries for everything. You learn your best practices over time, you know everything about them. It may take you a little longer to get up to speed, but once you do you're the master of your domain and you can implement something just as fast, if not faster, than someone with a huge framework.

There are justified complaints about every library and framework out there, and you will always run into stylistic and strategic differences unless you are starting from scratch as a developer and adopt their standards.

Shoehorning your custom code into a framework may end up taking more spit and glue than doing it from scratch yourself.

The way I see it is, the person who writes their own libraries is in complete control, but it takes him longer to get up to speed. Perfect for a one-man show.

The person who uses a framework gets up to speed fast, but fights the restrictions of the framework on the little things... which may end up costing more time in the long run than developing their own framework. This is great for a large shop that needs many people to work together without a lot of getting up to speed, and have things work based on standards.

So yeah, it depends where you're starting from, how much you want to learn, how much control you want, and who you're working with and for.
 
rage9, I don't see why we need to write in the same language. I don't know ruby, and php = lol. You use what you want, I use what I want, the best code wins. I'm thinking something along the lines of yellowpages, which has lots of data to scrape. Maybe reads a text file, that has a category per line to scrape, goes through all pages of that category, dump results to a CSV. I'd say important things are the ability to recover from a page not loading, proxy support. I can scale to a couple hundred threads, so good luck beating me on speed with php :)

Latency will most likely be the speed limitation if you're single threaded.

Haha, no doubt PHP couldn't stand next to python in terms of speed which is why I'd have to use Ruby. However by not using the same languages you wouldn't have a definable baseline to compare against IMHO.
 
Speed in my opinion is everything. You either write the best scraper you can in your chosen language or you don't.

It's that same attitude that causes people to use something like MVC which would end up being way slower than a traditional programming standard. It's the way it's always been. For some reason there is a large percentage of people whom just don't care about speed. Speed is important.

People try and argue that what I write may only be fractions of a second faster than what they write, but when it gets down to your server getting pounded every fraction counts IMHO.

I'm not saying that a scraper constitutes that, but if your scraping enough data I'd think you'd want your scraper running as fast as you can (within reason). If it's limited data who cares right?

I have an e-penis and want to debunk the myth that OO is so great, because often times so many jerk offs would use that vs procedural for no other reason than they think it's better. I'm here to break that stereotype. If I only win by fractions of a second I'll still be successful.

i am confused, because once upon a time four weeks ago, i said --
^^ on that note, is requests per second really your ultimate goal?
unless you're one of the largest botters here, and your hardware costs are substantial, i'd bet the answer is "no". shit like "evading botter detection" and "more pretty graphs in the UI" are probably higher priority, especially if your app is built to scale horizontally.
write it first in a scripting language -- something high level that prototypes quickly, whatever you're most comfortable in -- and build it to meet your specs. once you've got a working bot that does what you want, look at your requests per second, and decide if you'd rather spend the time writing optimization (or rewriting in low level code) to speed it up, or if it's more cost-efficient just launch more servers.

in my opinion, building a bot that works and does what you want is much more important than optimizing for requests/sec from the start. most people will never finish their bot in ruby, let alone in c++. as knuth said, "premature optimization is the root of all evil."

and you said
Exactly. You also have to take into consideration how fast and much bandwidth is available. You can only process data as fast as it comes in, so you're also hostage to the target server(s).

As "slow" as a SSSL's like PHP or Ruby are, they still have more than enough balls to scrape at a very fast rate. Hell if you really needed more speed you could compile PHP down into opt-code like Facebook does to squeeze another ~50% out of it.

so i was like "cool, this guy gets it"
but then you say
Haha, no doubt PHP couldn't stand next to python in terms of speed which is why I'd have to use Ruby. However by not using the same languages you wouldn't have a definable baseline to compare against IMHO.

and i am like "lolwat?"

if you don't know python, and php is "too slow", why would you "have to use ruby" which you don't know so well? what language can you write this superspeed speeddemon speedfiend of a speed bot in? i'd imagine anyone that fully relies on their own set of libraries by now at least has half the code for this project lying around.

fuck the baseline. use your own language. what happened to "speed is everything"? this mattseh pussy is trying to preach the virtues of OO and higher level interpreted languages, wtf, destroy him with your procedural C!
 
  • Like
Reactions: gutterseo