Dirty Butter Posted March 13, 2015 Share Posted March 13, 2015 I have created directories I do not want to be in the Sitemap, although I do want them to show on the site. I have already Dissallowed them in robots.txt. seo.class already has a section with other directories Dissallowed: $queryArray = array( 'category' => $GLOBALS['db']->select('CubeCart_category', array('cat_id'), array('status' => '1')), 'product' => $GLOBALS['db']->select('CubeCart_inventory', array('product_id', 'updated'), array('status' => '1')), // ORIGINAL 'document' => $GLOBALS['db']->select('CubeCart_documents', array('doc_id'), array('doc_parent_id' => '0', 'doc_status' => 1)), // BSMITHER HACK TO KEEP HIDDEN DOCUMENTS OUT OF SITEMAP 'document' => $GLOBALS['db']->select('CubeCart_documents', array('doc_id'), array('doc_parent_id' => '0', 'doc_status' => 1,'navigation_link'=>1)), // END BSMITHER HACK // Start SemperFi Addition removed to keep askabout page from showing up in Sitemap // 'askabout' => $GLOBALS['db']->select('CubeCart_inventory', array('product_id'), array('status' => '1')), // End SemperFi Addition ); I want to keep ALL Subcategories starting with /animal-brands out of the Sitemap. Quote Link to comment Share on other sites More sharing options...
havenswift-hosting Posted March 13, 2015 Share Posted March 13, 2015 Hi I dont know whether you are using the built in sitemap or a third party tool like the one we often install for clients, but the sitemap generation process should honour whatever is in the robots.txt file. So either there is a problem with the syntax in your robots.txt file or the sitemap generation tool isnt working correctly Ian Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted March 13, 2015 Author Share Posted March 13, 2015 I use the built-in one, but Bsmither helped me so it creates sitemap.xml directly, rather than just the gzip version. Mine has never "honored" the robots file, but our store is in a subdirectory of the domain, so the robots file that "counts" is not on plushcatalog. Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted April 2, 2015 Author Share Posted April 2, 2015 Still having to manually delete the Animals alphabetical sub-directories that I do not want in sitemap.xml. Based on Ian's comment - DOES v6 use the robots.txt file to determine which files to use in the Sitemap? If so, this is in my robots.txt file for the domain, and for the shop subdomain - Disallow: /*animal-brands If v6 does NOT use the robots.txt file - How can I tell v6 NOT to put the animal-brands subdirectories in the sitemap? I had assumed it would mean adding some code to seo.class. ALL our products have custom seo metadata, and the Store Setting does NOT use directories or subdirectories as part of the product url or seo url. Quote Link to comment Share on other sites More sharing options...
bsmither Posted April 2, 2015 Share Posted April 2, 2015 Is it safe to use animal-brands (with anything or nothing following) as the key phrase? If so, try this edit in /classes/seo.class.php: Find at the bottom of the file: private function _sitemap_link Was: if (!isset($input['url']) && !empty($type)) { $input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true); } Now: if (!isset($input['url']) && !empty($type)) { $generated_path = $this->generatePath($input['id'], $type, '', false, true); if( preg_match('/^animal-brands/', $generated_path) && $type == 'category' ) return false; $input['url'] = $store_url.'/'.$generated_path; } This will add a test for a category SEO Path that starts with animal-brands and if present, will exit this function before it can add that node to the XML. Later, we can explore how to incorporate a robots file. Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted April 2, 2015 Author Share Posted April 2, 2015 That worked perfectly! Thank you!! Quote Link to comment Share on other sites More sharing options...
bsmither Posted April 2, 2015 Share Posted April 2, 2015 Can you tell me how you understand this directive to work: Disallow: /*animal-brands Specifically, how would it disallow just anything having animal-brands anywhere in the URL? The way I see it, you would need to be very conscious of not putting animal-brands in any SEO path. Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted April 2, 2015 Author Share Posted April 2, 2015 I DO have one sub-category of Other Animal Brands, but I made sure I did not use that phrase in the seo url., but "other-brands-of-plush-animals" instead. Quote Link to comment Share on other sites More sharing options...
bsmither Posted April 2, 2015 Share Posted April 2, 2015 I think this isn't so much of a Skin & Template Support issue. Just sayin'. Let's make the SiteMap generator look for and respect the directives contained within the robots.txt file. First, have a robots.txt file. Second, in /includes/global.inc.php, add this statement:$glob['robots_path'] = '/root/server/path/to/your/site/robots.txt'; // consult your hosting provider if you can't determine what this should be Third, in /classes/seo.class.phpFind: /** * Class instance * * @var instance */ protected static $_instance; Add ABOVE: /** * Robots Disallowed: CubeCart types * * @var array */ private $_robot_disallow_types = array('category','product','document'); // edit to suit /** * Robots Disallowed: reg_expressions * * @var array */ private $_robot_disallows = array(); Find: public function sitemap() { Add AFTER: if( $GLOBALS['config']->has('config', 'robots_path') && file_exists($GLOBALS['config']->get('config', 'robots_path')) ) { $robot_directive_array = file($GLOBALS['config']->get('config', 'robots_path'), FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES); foreach( $robot_directive_array as $line ) { if( preg_match('#Disallow:*/(.+)#i', $line, $matches) ) $this->_robot_disallows[] = str_replace('*', '.*', $matches[1]); } } Find: private function _sitemap_link($input, $updated = false, $type = false) { Replace: if (!isset($input['url']) && !empty($type)) { $input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true); } With: if (!isset($input['url']) && !empty($type)) { $generated_path = $this->generatePath($input['id'], $type, '', false, true); if( !empty($this->_robot_disallows) ) { foreach( $this->_robot_disallows as $disallowed ) { if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { return false; } } } $input['url'] = $store_url.'/'.$generated_path; } This will use any Disallow: directive found in robots.txt. These directives can get somewhat complicated, as directives that follow a User-agent: directive apply only to that User-agent. So, other Disallow: directives following a second/third/etc User-agent: directive will also get respected. Also, a Disallow: / directive (one that says to not index anything) is ignored. Why make a sitemap for such a site? Also, Allow: directives are not respected. Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted April 3, 2015 Author Share Posted April 3, 2015 I'm using the More ajax bar, rather than Pagination numbers on the Storefront. It doesn't look like that creates any page=1, page=2 type pages the way v5 did. IF Google still "sees" these page=number patterns, or for someone using Pagination on the front end - then I see one possible problem with the way robots.txt gets written. In order to keep page=1, page=2, etc. out of the sitemap, but retain page=all for each directory - this has been in our robots.txt: Disallow: /*page= Allow: /*page=all Your code above would not find the page=all page for each category/subcategory. Google had complained in the past when I didn't allow page=all - and then complained if the meta data was the same for a category/subcategory page and page=all (you had helped with that in another thread some time ago, but I have not tried to port that fix to v6 yet) Quote Link to comment Share on other sites More sharing options...
bsmither Posted April 3, 2015 Share Posted April 3, 2015 So far, I am only interested in the Disallow: directives. Since page= is not an SEO path (it's a part of the querystring, and maybe canonical), it will not be caught. I do not understand how page=1, page=2, etc could even get in the sitemap. A robot may find those links on its own and index them, but not from the sitemap -- unless I'm missing something. I haven't experimented with the ajax More bar, but if it behaves as I think it does, it is javascript-powered. Thus, a spider, depending on it's sophistication, may or may not execute that javascript as it scans the HTML. That being assumed, I wonder if a spider will eventually see all the content (More..., More..., More..., for how many times until it gives up) that a page=all page would otherwise provide. Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted April 3, 2015 Author Share Posted April 3, 2015 I would love to be able to Disallow page=all as well, as it's a nuisance issue that for some reason or other Google considers important. I'll try your code fix above and see what Google says over the next month or so. Then I'll return to this thread to see how Google reacts to it. Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted April 3, 2015 Author Share Posted April 3, 2015 Well, it didn't work for me on either store. I thought it might be because our plush store is not at the root, but the estates store IS at the domain root. Neither sitemap omitted something I had intentionally disallowed from the respective robots.txt file as a test. Since dirtybutterestates.com is a simpler setup, I used $glob['robots_path'] = '/dirtybutterestates.com/robots.txt'; in global includes. That usually means I didn't follow your directions correctly so here are my edits: private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register'); // BSMITHER ROBOTS SITEMAP CODE /** * Robots Disallowed: CubeCart types * * @var array */ private $_robot_disallow_types = array('category','product','document'); // edit to suit /** * Robots Disallowed: reg_expressions * * @var array */ private $_robot_disallows = array(); // END BSMITHER ROBOTS SITEMAP CODE then public function sitemap() { // BSMITHER ROBOTS SITEMAP CODE if( $GLOBALS['config']->has('config', 'robots_path') && file_exists($GLOBALS['config']->get('config', 'robots_path')) ) { $robot_directive_array = file($GLOBALS['config']->get('config', 'robots_path'), FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES); foreach( $robot_directive_array as $line ) { if( preg_match('#Disallow:*/(.+)#i', $line, $matches) ) $this->_robot_disallows[] = str_replace('*', '.*', $matches[1]); } } // END BEMITHER ROBOTS SITEMAP CODE // Generate a Sitemap Protocol v0.9 compliant sitemap (http://sitemaps.org) and finally private function _sitemap_link($input, $updated = false, $type = false) { $updated = (!$updated) ? time() : $updated; $store_url = (CC_SSL) ? $GLOBALS['config']->get('config', 'standard_url') : $GLOBALS['storeURL']; // ORIGINAL B4 BSMITHER ROBOTS SITEMAP CODE if (!isset($input['url']) && !empty($type)) { // $input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true); //} // BSMITHER ROBOTS SITEMAP CODE if (!isset($input['url']) && !empty($type)) { $generated_path = $this->generatePath($input['id'], $type, '', false, true); if( !empty($this->_robot_disallows) ) { foreach( $this->_robot_disallows as $disallowed ) { if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { return false; } } } $input['url'] = $store_url.'/'.$generated_path; } // END BSMITHER ROBOTS SITEMAP CODE $this->_sitemap_xml->startElement('url'); Quote Link to comment Share on other sites More sharing options...
bsmither Posted April 3, 2015 Share Posted April 3, 2015 The example I gave for Step 2: '/root/server/path/to/your/site/robots.txt'; You have: '/dirtybutterestates.com/robots.txt'; I will say that this is not the root-server-path. So, the code is not finding the robots.txt file. If you have a cPanel, you may be shown what the root path to your website folder would be. In admin, Error log, System Error log tab, or perhaps the PHP error_log, there may be an entry that states an error -- something vaguely similar to: Warning: Invalid argument supplied for foreach() in L:WebServer528_debugadminsourcesproducts.index.inc.php on line 366 For my system, the L:WebServer528_debug would be the root-server-path to this particular installation. Or perhaps something vaguely similar to: /home/my_hosting_account_name/public_html/ Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted April 3, 2015 Author Share Posted April 3, 2015 OK - well , I thought I knew what the root server path was. I found it on cpanel and now your code is working perfectly! Thank you, as always, for your patience with my ignorance!! Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted June 9, 2015 Author Share Posted June 9, 2015 Since I'm using this robots obeying sitemap code of yours, I expected to be able to take certificates out of our sitemaps very easily. But it did not work. We have certificates disabled on both stores, so I really don't understand why they end up in the sitemap, anyway.Any ideas? Quote Link to comment Share on other sites More sharing options...
bsmither Posted June 9, 2015 Share Posted June 9, 2015 In our experiment with SEO->_sitemap_link(), there is this new code:// if (!isset($input['url']) && !empty($type)) { // $input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true); // } if (!isset($input['url']) && !empty($type)) { $generated_path = $this->generatePath($input['id'], $type, '', false, true); if( !empty($this->_robot_disallows) ) { foreach( $this->_robot_disallows as $disallowed ) { if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { // From class header: $_robot_disallow_types = array('category','product','document'); // edit to suit // There is no $type for sale_items or certificates, so never a match. // Returning out early because of a match. return false; } } } else { // The _robot_disallows array is empty. } $input['url'] = $store_url.'/'.$generated_path; } So, we can see here on line 10 that we will leave early if the URL mmatches a robots exclusion line and the type of link is a category, product, or document.Looking at SEO->sitemap(), CubeCart calls the _sitemap_link() function with no type specified for the index, sale_items, and certificates pages. Thus, we do not leave early and a link is created.How do we get certificates to not be in the sitemap?1. Add a test in SEO->sitemap() to not call _sitemap_link() if the config==certificates is disabled.2. But if enabled, still call _sitemap_link() but with a special $type value, also added to _robot_disallow_types(Haven't tested #2) but that might confuse SEO->generatePath() with an unknown $type. So,3. Add to _robot_disallow_types the array element (boolean) false (that is, not in quotes).Try setting this at the top of the class file:$_robot_disallow_types = array('category','product','document', false); Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted June 9, 2015 Author Share Posted June 9, 2015 I wasn't sure where to put the new line - so obvious for you, but not for me $_robot_disallow_types = array('category','product','document', false);I put it here in seo.class.php: It does not work. Certificates are Disallowed in robots already, but a new sitemap still has it.$_robot_disallow_types = array('category','product','document', false); if (!isset($input['url']) && !empty($type)) { $generated_path = $this->generatePath($input['id'], $type, '', false, true); if( !empty($this->_robot_disallows) ) { foreach( $this->_robot_disallows as $disallowed ) { if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { return false; } } } $input['url'] = $store_url.'/'.$generated_path; } Quote Link to comment Share on other sites More sharing options...
bsmither Posted June 9, 2015 Share Posted June 9, 2015 Try setting this at the top of the class file:$_robot_disallow_types = array('category','product','document', false);At the top of the class file, I mean to say, in the file /classes/seo.class.php, near line 77: private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register'); /** YOU MIGHT HAVE THESE NEXT TWO SECTIONS ALREADY * Robots Disallowed: CubeCart types * * @var array */ private $_robot_disallow_types = array('category','product','document'); // edit to suit /** * Robots Disallowed: reg_expressions * * @var array */ private $_robot_disallows = array(); /** * Class instance * * @var instance */ protected static $_instance; Quote Link to comment Share on other sites More sharing options...
Dirty Butter Posted June 9, 2015 Author Share Posted June 9, 2015 I'm too tired, and I have too many edits to make sense of what you want me to do. /** THIS IS LINE 71 * Sitemap XML handle * * @var handle */ private $_sitemap_xml = false; /** * Static URL sections * * @var array of strings */ // BEGINNING SEMPERFI TESTIMONIALS // Old Code: // private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register'); // New Code: private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register', 'testimonials', 'addtestimonial'); // END SEMPERFI TESTIMONIALS // BSMITHER ROBOTS SITEMAP HACK /** * Robots Disallowed: CubeCart types * * @var array */ private $_robot_disallow_types = array('category','product','document'); // edit to suit /** * Robots Disallowed: reg_expressions * * @var array */ private $_robot_disallows = array(); // END BSMITHER ROBOTS SITEMAP HACK Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.