Google Tech Talks
December, 14 2007
ABSTRACT
We present an algorithm, WITCH, that learns to detect spam hosts or pages on the web. Unlike most other approaches, it simultaneously exploits the structure of the Web graph, as well as page contents and features. This work is a collaboration with Olivier Chapelle and Carlos Castillo, both of Yahoo! Inc.
Speaker: Jake Abernethy
So pages are penalized for both being spam and outdated. Still sounds useful.
stcredzero 4 years ago
Good Video, we don't get to see enough webspam detection lectures online (I guess for obvious reasons).
(I'm Aware of AIRWEB etc but not sure if I'm going to manage to get to Beijing)
Interested in seeing an efficient MapReduce version of a Conjugate Gradient method, if someone has references could they message me them?
timwintle 4 years ago
Ok... thanks... I was not aware of that paper! Sorry!
fabriziosilvestri 4 years ago
Hi Fabrizio, this talk is based on an upcoming paper:
Jacob Abernethy, Olivier Chapelle, Carlos Castillo: "WITCH: A New Approach to Web Spam Detection". 2007.
--
ChaTo
ChaTo1977 4 years ago
Yes, and perhaps that is due to the fact that Carlos Castillo was also a coauthor on this work! :P
thejakeyboy 4 years ago
It is quite surprising how this presentation present many ideas that seem to be contained also in
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. Know your neighbors: web spam detection using the web topology. In Proceedings of SIGIR '07. (Amsterdam, The Netherlands, July 23 - 27, 2007). 423-430.
Well, I'm a little bit biased, I'm one of the co-authors :P
fabriziosilvestri 4 years ago
The algorithm proposed does seem very naive.
Which isn't to say that the presenter isn't MUCH smarter than I am - maybe just not as old and crusty :)
- Dave
davidjbullock 4 years ago
Considering how often expired domains seem to be snapped up and replaced with generic search pages - I'd have thought that good pages linking to bad would be quite common. It's happened to me.
neuronstorm 4 years ago
Why would good pages link to bad pages?
That's easy - the factor that I don't see in your graph is time. A domain that is valid at time A may not be valid at time B.
If site X links to Site Y and Y is valid at time A, but not valid at time B, it doesn't infrer that site X is a spam page.
davidjbullock 4 years ago 2