Datafeeds categories > your database

Icecube

Up 24h/day
Mar 14, 2007
1,169
9
0
Europe
if you import the feeds as they are into your database you come up with several names for the same category
how do you deal with the thousands of different categories you find in datafeeds from multiple merchants and networks?


I mean technically, do you create some kind of association between the feed categories and your own categories (hence having to check which categories are in each datafeed ) or do you create virtual categories using LIKE '%keyword%' queries?
 


There's no simple way to take care of this that I know of. It's going to be a bit labor-intensive if you're dealing with multiple datafeeds.

The way I've done it in the past (when using multiple datafeeds, on the same site, with multiple cats) is upon parsing to use a LIKE to rename it to the "correct" name.

So let's say my site has "Men's Shirts". But datafeeds may have "Shirts (Men's)", "Shirts->Men's" (as a sub-cat), and "Shirts - Mens". If I'm working with the same merchants and I KNOW that these variations exist, I just hard-code my parser to look for the above cats and upon importing, give them all the category of "Men's Shirts".

If all of that seems like a hassle, just use one datafeed OR just say screw it and let it create the variations.

The more datafeeds, the more variations of category names, the more you're going to have to include logic within your script to sort it all out. And that requires a great deal of human interaction (i.e. you looking at all the datafeeds to see what the different names are).
 
A couple months ago I tried to d/l all of the ebay categories and ended up stopping on level 3 at 3000+; that's when I realized how easily my stuff could get clusterfuck'd as I add new sources.

The problem with using LIKE is that it is just going to pick up variations of the same word groupings (Men's Shirts, Shirts->Men)

Then you have to also come up with an algorithm to deal with things like 'clothing' vs 'garments', where you would have to group terms that refer to the same (by your definition) category.

Here's something I have been looking at- instead of assigning similar category names to the same group, use brand names and in some cases, even drill down into product attributes to assign categories.

You could use LIKE on the individual terms to catch those variations (model numbers in particular) and the same match algorithm as you would for category synonyms, just feed the extra terms into the input. Since the matching would be based on existing data, the more data you accumulate, the more accurate it will get.

Once you are matching things in a particular area to a certain percent probability, you should be able to eliminate some of those operations to save resources. A job for neural nets maybe?
 
yes I'm talking about feeds from multiple merchants from shareasale, cj, zanox etc

rather than hardcoding the terms in the script you can create a table where you have the 'found value' and the 'correct value' to load into an array before parsing the feed, just to make it more easily maintainable, however this requires manually checking the categories in every feed

the handiest thing that comes to my mind is having as first step of the import process a page where you get the list of categories in the feed on the left, and your current category structure on the right: this way you can match the feed categories and your own categories selecting couples, one category per time.

at the moment I'm just playing with wordpress, I opened this discussion to find the best method for a project I have in mind

I still believe the LIKE '%query%' thing is widely used for displaying items ( and it doesn't require you to deal with the category problem )

other ideas and opinions?
 
A couple months ago I tried to d/l all of the ebay categories and ended up stopping on level 3 at 3000+; that's when I realized how easily my stuff could get clusterfuck'd as I add new sources.

The problem with using LIKE is that it is just going to pick up variations of the same word groupings (Men's Shirts, Shirts->Men)

Then you have to also come up with an algorithm to deal with things like 'clothing' vs 'garments', where you would have to group terms that refer to the same (by your definition) category.

Here's something I have been looking at- instead of assigning similar category names to the same group, use brand names and in some cases, even drill down into product attributes to assign categories.

You could use LIKE on the individual terms to catch those variations (model numbers in particular) and the same match algorithm as you would for category synonyms, just feed the extra terms into the input. Since the matching would be based on existing data, the more data you accumulate, the more accurate it will get.

Once you are matching things in a particular area to a certain percent probability, you should be able to eliminate some of those operations to save resources. A job for neural nets maybe?

I'm not sure I got you

a problem with the LIKE solution ( as I intended to do it ) is that it wouldn't be possible to get multiple levels of categories

and you would obviously need a table to store keywords or matching expressions to retrieve items

example: the user clicks on "Men's Shirts", which leads him to mens-shirts.html, /mens-shirts/ or whatever format you're using
at this point you need 2 steps to retrieve the correct content from the db
1- retrieve the conditions for "mens-shirts" ( which you could implement as a series of keywords, with even negative keywords or whatever you need to build your query: it could be for example +shirt+men+man+male-women-woman-boy-child-tshirt, you could use this string to build your query )
No clue how efficient this method would be when you have tens of thousands of products for sale
2- run the actual query that retrieves the products