First, a warning: To those that read my blog and are typically interested in posts on Lean Manufacturing and Change Management, this article is not on either of those topics. Okay, read on. I concede: I know almost nothing about SEO, but other publications I read have been discussing this “Panda” update or “Farmer” update that Google rolled out recently. So, I decided to read-up and learn about what’s going on. Furthermore, I don’t know jack about Google. Other than my 2006 job interview with Google, I am simply a user of Google search. I don’t have any insider dealings. I’m just another guy. When the first major algorithm update came out, many in the SEO world dubbed the update the “Farmer” update because the aim of the update was to devalue content farms and, by doing so, increase the value of high quality sites by reducing the value of low quality sites – pretty much sites that are like a neighborhood of dog poopy.
I guess a bunch of websites got affected by the update – big sites too. That’s pretty much what I know. But, what got lost in all the debate was this important question: Who the heck is Panda?
Who is Panda?
Well, we know that Panda is a Google engineer, as explained by Matt Cutts in this interview with Wired Magazine:
Wired.com: What’s the code name of this update? Danny Sullivan of Search Engine Land has been calling it Farmer because its apparent target is content farms. Amit Singhal: Well, we named it internally after an engineer, and his name is Panda. So internally we called a big Panda. He was one of the key guys. He basically came up with the breakthrough a few months back that made it possible.
Here, Amit Singhal, who I guess is an important person in the SEO world, verifies that “Panda” is a person – yeah, a real human being with a pretty cool name. And, the update was based on his breakthrough. So, if Panda is a person, whose recent breakthrough led to a massive change in how websites are valued in the eyes of Google, what we can know about him might help the largely confused world regarding the Panda or Farmer or whatever update. So, what do we know about him? And, can some knowledge of his background, research interest, or whatever give us a hint as to how one can survive the dreaded Panda or Farmer update? Can our knowledge of Panda’s background help Black Hat SEOs better game Google? Obviously I’m not the best person to answer those questions, but here’s what we know about Panda, taken from a simple search on Google, Linkedin, Facebook, and Twitter.
Who is Navneet Panda?
Based on his homepage, his resume, his facebook profile, google buzz, and his linkedin profile we know a few things:
- Navneet Panda studied at the Indian Institute of Technology in Kharagpur in the Department of Mathematics and earned a MSc in Mathematics and Computing ( Integrated 5-year course )
- Navneet Panda then went on to the University of California Santa Barbara, where he earned a Ph.D in Computer Science. His advisor was Edward Y. Chang.
It appears that before he worked for Google in 2007, he did a summer internship at Intel and at the IBM T. J. Watson Research Center in New York. Navneet Panda has filed 2 patents, and they are described below:
- Learning Concept Templates from Web Images to Query Personal Image Databases, Navneet Panda, Yi Y. Wu, Jean-Yves Bougueti, Ara Neï¬an (Filed with Intel, June 2007)
- Fast Approximate SVM Classiï¬cation for Large-Scale Stream Filtering, Navneet Panda, Ching-Yung Lin and Lisa D. Amini (Filed with IBM, Sep 2005)
Below are a list of his publications followed by a short abstract, which might give us a sense of what might have been behind the Google Panda Update:
- Efficient Top-k Hyperplane Query Processing for Multimedia Information RetrievalAbstract: A query can be answered by a binary classifier, which separates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthest from the hyperplane are deemed to be most relevant to the query, and that are nearest to the hyperplane to be most uncertain to the query. In this paper, we address the twin problems of efficient retrieval of the approximate set of instances (a) farthest from and (b) nearest to a query hyperplane. Retrieval of instances for this hyperplane-based query scenario is mapped to the range-query problem allowing for the reuse of existing index structures. Empirical evaluation on large image datasets confirms the effectiveness of our approach (link).
- Concept Boundary Detection for Speeding up SVMs: Support Vector Machines (SVMs) suffer from an O(n2) training cost, where n denotes the number of training instances. In this paper, we propose an algorithm to select boundary instances as training data to substantially reduce n. Our proposed algorithm is motivated by the result of (Burges, 1999) that, removing non-support vectors from the training set does not change SVM training results. Our algorithm eliminates instances that are likely to be non-support vectors. In the concept independent preprocessing step of our algorithm, we prepare nearest-neighbor lists for training instances. In the concept-specied sampling step, we can then effectively select useful training data for each target concept. Empirical studies show our algorithm to be effective in reducing n, outperforming other competing downsampling algorithms without signicantly compromising testing accuracy (link).
- KDX: An Indexer for Support Vector Machines: Support Vector Machines (SVMs) have been adopted by many data-mining and information-retrieval applications for learning a mining or query concept, and then retrieving the top-k best matches to the concept. However, when the dataset is large, naively scanning the entire dataset to find the top matches is not scalable. In this work, we propose a kernel indexing strategy to substantially prune the search space and thus improve the performance of top-k queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quickly converges on an approximate set of top-k instances of interest. More importantly, once the kernel (e.g., Gaussian kernel) has been selected and the indexer has been constructed, the indexer can work with different kernel-parameter settings (e.g., and ) without performance compromise. Through theoretical analysis, and empirical studies on a wide variety of datasets, we demonstrate KDX to be very effective (link).
- Exploiting Geometry for Support Vector Machine Indexing: Support Vector Machines (SVMs) have been adopted by many data-mining and information-retrieval applications for learning a mining or query concept, and then retrieving the top-k best matches to the concept. However, when the dataset is large, naively scanning the entire dataset to find the top matches is not scalable. In this work, we propose a kernel indexing strategy to substantially prune the search space and thus improve the performance of top-k queries. Our kernel indexer (KDX) takes advantage of the underlying geometric properties and quickly converges on an approximate set of top-k instances of interest. More importantly, once the kernel (e.g., Gaussian kernel) has been selected and the indexer has been constructed, the indexer can work with different kernel-parameter settings without performance compromise. Through theoretical analysis, and empirical studies on a wide variety of datasets, we demonstrate KDX to be very effective (link).
- Hypersphere Indexer: Indexing high-dimensional data for efficient nearest-neighbor searches poses interesting research challenges. It is well known that when data dimension is high, the search time can exceed the time required for performing a linear scan on the entire dataset. To alleviate this dimensionality curse, indexing schemes such as locality sensitive hashing (LSH) and M-trees were proposed to perform approximate searches. In this paper, we propose a hypersphere indexer, named Hydex, to perform such searches. Hydex partitions the data space using concentric hyperspheres. By exploiting geometric properties, Hydex can perform effective pruning. Our empirical study shows that Hydex enjoys three advantages over competing schemes for achieving the same level of search accuracy. First, Hydex requires fewer seek operations. Second, Hydex can maintain sequential disk accesses most of the time. And third, it requires fewer distance computations (link).
- Active Learning in Very Large Databases: Query-by-example and query-by-keyword both suffer from the problem of aliasing, meaning that example-images and keywords potentially have variable interpretations or multiple semantics. For discerning which semantic is appropriate for a given query, we have established that combining active learning with kernel methods is a very effective approach. In this work, we first examine active-learning strategies, and then focus on addressing the challenges of two scalability issues: scalability in concept complexity and in dataset size. We present remedies, explain limitations, and discuss future directions that research might take (link).
- Formulating Context-dependent Similarity: Tasks of information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a context-dependent (also application-, data-, and user-dependent) way. In this paper, we present a novel method, which learns a distance function by capturing the nonlinear relationships among contextual information provided by the application, data, or user. We show that through a process called the kernel trick, such nonlinear relationships can be learned efficiently in a projected space. In addition to using the kernel trick, we propose two algorithms to further enhance efficiency and effectiveness of function learning. For efficiency, we propose a SMO-like solver to achieve O(N2) learning performance. For effectiveness, we propose using unsupervised learning in an innovative way to address the challenge of lack of labeled data (contextual information). Theoretically, we substantiate that our method is both sound and optimal. Empirically, we demonstrate that our method is effective and useful (link).
- Formulating Distance Functions via the Kernel Trick: Tasks of data mining and information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a context dependent (also application-, data-, and user-dependent) way. In this paper, we propose to learn a distance function by capturing the nonlinear relationships among contextual information provided by the application, data, or user. We show that through a process called the kernel trick, such nonlinear relationships can be learned efficiently in a projected space. Theoretically, we substantiate that our method is both sound and optimal. Empirically, using several datasets and applications, we demonstrate that our method is effective and useful (link).
- Speeding up Approximate SVM Classification for Data Streams
- Improving Accuracy of SVMs by Allowing Support Vector Control
Here are what he lists as Research Projects on his resume: Machine Learning:
- Development of indexing structures for support vector machines to enable relevant instance search in high-dimensional datasets
- Speeding up SVM training in multi-category large dataset scenarios
- Speeding up approximate SVM classiï¬cation of data-streams
- Improving concept identiï¬cation and classiï¬cation for personal image retrieval
- Using idealizing kernels to develop distance metrics incorporating user preferences for high-dimensional data
- Design of a real time web page classiï¬er for text and image data
Grid Computing and Distributed Systems:
- Development of scheduling strategies for numerous large jobs in a grid environment under heavy load conditions using the Network Weather Service and Globus
- Development of scheduling strategies for executing compute-intensive jobs in a dynamically evolving simulated market of servers providing priced slots of CPU time for process execution
- Development of a distributed dictionary enforcing causal ordering
- Development of dynamic peer to peer system with query lookup modeling the CAN architecture
Computer Architecture:
- Design of a snoopy cache for a multiprocessor system
- Design of a superscalar instruction dispatch unit
Now What?
I don’t know. I’ll leave it to the SEO people to decide. I just write and don’t pay much attention to SEO because I don’t know much about SEO. But, at least now we can put a face to a generically named Google algorithm update called “Panda”. Now, when someone references a Google algorithm update as “Panda”, we can all, under our breath say, “Yeah, that Navneet Panda guy”.
Ralph du Plessis
This is a great post and certainly one that will please the Panda update team because it is a well thought out and researched, unique and engaging piece of content 😉
Assuming none of your “I don’t know shit about SEO…” comments were genuine and not in jest, I thought I’d add an FYI which is that Google always/usually name their big updates after someone in the team…
The last one was the May Day update, timely, but before that we had updates like Vince, Austin, Casandra, Esmerelda, Dominic, Jagger… in no particular order
this is quite common practice amongst the tech minded e.g. a place I used to work had named the different servers using famous movie actresses… no idea why, but it was bloody confusing when you wanted to ask about the Hotels database and had to refer to it as “Audrey”…
Anyway… thanks for the post .
Jon
Those papers are all a but heavy for me, but then I guess his past research may not be what the Panda updates are about anyway. He has been at Google for 4 years, so maybe his work led him there, but maybe it has developed in a whole new direction since. What is interesting is that the research is all about relationships between data, and I have had some success in updating content with this in mind, but not a lot!