v6 How to modify sitemap

Dirty Butter · March 13, 2015

I have created directories I do not want to be in the Sitemap, although I do want them to show on the site. I have already Dissallowed them in robots.txt.

seo.class already has a section with other directories Dissallowed:

		$queryArray = array(
			'category' => $GLOBALS['db']->select('CubeCart_category', array('cat_id'), array('status' => '1')),
			'product' => $GLOBALS['db']->select('CubeCart_inventory', array('product_id', 'updated'), array('status' => '1')),
		// ORIGINAL	'document'	=> $GLOBALS['db']->select('CubeCart_documents', array('doc_id'), array('doc_parent_id' => '0', 'doc_status' => 1)),
	// BSMITHER HACK TO KEEP HIDDEN DOCUMENTS OUT OF SITEMAP
'document'	=> $GLOBALS['db']->select('CubeCart_documents', array('doc_id'), array('doc_parent_id' => '0', 'doc_status' => 1,'navigation_link'=>1)),
// END BSMITHER HACK
// Start SemperFi Addition removed to keep askabout page from showing up in Sitemap
		//	'askabout'	=> $GLOBALS['db']->select('CubeCart_inventory', array('product_id'), array('status' => '1')),
			// End SemperFi Addition
		);

I want to keep ALL Subcategories starting with /animal-brands out of the Sitemap.

havenswift-hosting · March 13, 2015

Hi

I dont know whether you are using the built in sitemap or a third party tool like the one we often install for clients, but the sitemap generation process should honour whatever is in the robots.txt file. So either there is a problem with the syntax in your robots.txt file or the sitemap generation tool isnt working correctly

Ian

Dirty Butter · March 13, 2015

I use the built-in one, but Bsmither helped me so it creates sitemap.xml directly, rather than just the gzip version. Mine has never "honored" the robots file, but our store is in a subdirectory of the domain, so the robots file that "counts" is not on plushcatalog.

Dirty Butter · April 2, 2015

Still having to manually delete the Animals alphabetical sub-directories that I do not want in sitemap.xml. Based on Ian's comment - DOES v6 use the robots.txt file to determine which files to use in the Sitemap? If so, this is in my robots.txt file for the domain, and for the shop subdomain - Disallow: /*animal-brands

If v6 does NOT use the robots.txt file - How can I tell v6 NOT to put the animal-brands subdirectories in the sitemap? I had assumed it would mean adding some code to seo.class. ALL our products have custom seo metadata, and the Store Setting does NOT use directories or subdirectories as part of the product url or seo url.

bsmither · April 2, 2015

Is it safe to use animal-brands (with anything or nothing following) as the key phrase?

If so, try this edit in /classes/seo.class.php:

Find at the bottom of the file:
private function _sitemap_link
 
Was:
        if (!isset($input['url']) && !empty($type)) {
            $input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true);
        }
 
Now:
        if (!isset($input['url']) && !empty($type)) {
            $generated_path = $this->generatePath($input['id'], $type, '', false, true);
            if( preg_match('/^animal-brands/', $generated_path) && $type == 'category' ) return false;
            $input['url'] = $store_url.'/'.$generated_path;
        }

This will add a test for a category SEO Path that starts with animal-brands and if present, will exit this function before it can add that node to the XML.

Later, we can explore how to incorporate a robots file.

Dirty Butter · April 2, 2015

That worked perfectly! Thank you!!

bsmither · April 2, 2015

Can you tell me how you understand this directive to work:

Disallow: /*animal-brands

Specifically, how would it disallow just anything having animal-brands anywhere in the URL?

The way I see it, you would need to be very conscious of not putting animal-brands in any SEO path.

Dirty Butter · April 2, 2015

I DO have one sub-category of Other Animal Brands, but I made sure I did not use that phrase in the seo url., but "other-brands-of-plush-animals" instead.

bsmither · April 2, 2015

I think this isn't so much of a Skin & Template Support issue. Just sayin'.

Let's make the SiteMap generator look for and respect the directives contained within the robots.txt file.

First, have a robots.txt file.

Second, in /includes/global.inc.php, add this statement:

$glob['robots_path'] = '/root/server/path/to/your/site/robots.txt'; // consult your hosting provider if you can't determine what this should be

Third, in /classes/seo.class.php

Find:
    /**
     * Class instance
     *
     * @var instance
     */
    protected static $_instance;
 
Add ABOVE:
    /**
     * Robots Disallowed: CubeCart types
     *
     * @var array
     */
    private $_robot_disallow_types   = array('category','product','document'); // edit to suit
    /**
     * Robots Disallowed: reg_expressions
     *
     * @var array
     */
    private $_robot_disallows   = array();
 
 
Find:
    public function sitemap() {
 
Add AFTER:
if( $GLOBALS['config']->has('config', 'robots_path') && file_exists($GLOBALS['config']->get('config', 'robots_path')) ) {
    $robot_directive_array = file($GLOBALS['config']->get('config', 'robots_path'), FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
    foreach( $robot_directive_array as $line ) {
        if( preg_match('#Disallow:*/(.+)#i', $line, $matches) ) $this->_robot_disallows[] = str_replace('*', '.*', $matches[1]);
    }
}
 
 
Find:
    private function _sitemap_link($input, $updated = false, $type = false) {
 
Replace:
        if (!isset($input['url']) && !empty($type)) {
            $input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true);
        }
 
With:
if (!isset($input['url']) && !empty($type)) {
    $generated_path = $this->generatePath($input['id'], $type, '', false, true);
    if( !empty($this->_robot_disallows) ) {
        foreach( $this->_robot_disallows as $disallowed ) {
             if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { return false; }
        }
    }
    $input['url'] = $store_url.'/'.$generated_path;
}

This will use any Disallow: directive found in robots.txt. These directives can get somewhat complicated, as directives that follow a User-agent: directive apply only to that User-agent. So, other Disallow: directives following a second/third/etc User-agent: directive will also get respected.

Also, a Disallow: / directive (one that says to not index anything) is ignored. Why make a sitemap for such a site?

Also, Allow: directives are not respected.

Dirty Butter · April 3, 2015

I'm using the More ajax bar, rather than Pagination numbers on the Storefront. It doesn't look like that creates any page=1, page=2 type pages the way v5 did.

IF Google still "sees" these page=number patterns, or for someone using Pagination on the front end - then I see one possible problem with the way robots.txt gets written. In order to keep page=1, page=2, etc. out of the sitemap, but retain page=all for each directory - this has been in our robots.txt:

Disallow: /*page=

Allow: /*page=all

Your code above would not find the page=all page for each category/subcategory.

Google had complained in the past when I didn't allow page=all - and then complained if the meta data was the same for a category/subcategory page and page=all (you had helped with that in another thread some time ago, but I have not tried to port that fix to v6 yet)

bsmither · April 3, 2015

So far, I am only interested in the Disallow: directives. Since page= is not an SEO path (it's a part of the querystring, and maybe canonical), it will not be caught.

I do not understand how page=1, page=2, etc could even get in the sitemap. A robot may find those links on its own and index them, but not from the sitemap -- unless I'm missing something.

I haven't experimented with the ajax More bar, but if it behaves as I think it does, it is javascript-powered. Thus, a spider, depending on it's sophistication, may or may not execute that javascript as it scans the HTML.

That being assumed, I wonder if a spider will eventually see all the content (More..., More..., More..., for how many times until it gives up) that a page=all page would otherwise provide.

Dirty Butter · April 3, 2015

I would love to be able to Disallow page=all as well, as it's a nuisance issue that for some reason or other Google considers important. I'll try your code fix above and see what Google says over the next month or so. Then I'll return to this thread to see how Google reacts to it.

Dirty Butter · April 3, 2015

Well, it didn't work for me on either store. I thought it might be because our plush store is not at the root, but the estates store IS at the domain root. Neither sitemap omitted something I had intentionally disallowed from the respective robots.txt file as a test.

Since dirtybutterestates.com is a simpler setup, I used $glob['robots_path'] = '/dirtybutterestates.com/robots.txt'; in global includes.

That usually means I didn't follow your directions correctly so here are my edits:

	private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register');
// BSMITHER ROBOTS SITEMAP CODE
/**
     * Robots Disallowed: CubeCart types
     *
     * @var array
     */
    private $_robot_disallow_types   = array('category','product','document'); // edit to suit
    /**
     * Robots Disallowed: reg_expressions
     *
     * @var array
     */
    private $_robot_disallows   = array();
// END BSMITHER ROBOTS SITEMAP CODE

then

	public function sitemap() {
// BSMITHER ROBOTS SITEMAP CODE
if( $GLOBALS['config']->has('config', 'robots_path') && file_exists($GLOBALS['config']->get('config', 'robots_path')) ) {
    $robot_directive_array = file($GLOBALS['config']->get('config', 'robots_path'), FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
    foreach( $robot_directive_array as $line ) {
        if( preg_match('#Disallow:*/(.+)#i', $line, $matches) ) $this->_robot_disallows[] = str_replace('*', '.*', $matches[1]);
    }
}
// END BEMITHER ROBOTS SITEMAP CODE
		// Generate a Sitemap Protocol v0.9 compliant sitemap (http://sitemaps.org)

and finally


	private function _sitemap_link($input, $updated = false, $type = false) {
		$updated = (!$updated) ? time() : $updated;

		$store_url = (CC_SSL) ? $GLOBALS['config']->get('config', 'standard_url') : $GLOBALS['storeURL'];

		// ORIGINAL B4 BSMITHER ROBOTS SITEMAP CODE if (!isset($input['url']) && !empty($type)) {
		//	$input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true);
		//}
// BSMITHER ROBOTS SITEMAP CODE
if (!isset($input['url']) && !empty($type)) {
    $generated_path = $this->generatePath($input['id'], $type, '', false, true);
    if( !empty($this->_robot_disallows) ) {
        foreach( $this->_robot_disallows as $disallowed ) {
             if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { return false; }
        }
    }
    $input['url'] = $store_url.'/'.$generated_path;
}
// END BSMITHER ROBOTS SITEMAP CODE

		$this->_sitemap_xml->startElement('url');

bsmither · April 3, 2015

The example I gave for Step 2:

'/root/server/path/to/your/site/robots.txt';

You have:

'/dirtybutterestates.com/robots.txt';

I will say that this is not the root-server-path. So, the code is not finding the robots.txt file. If you have a cPanel, you may be shown what the root path to your website folder would be.

In admin, Error log, System Error log tab, or perhaps the PHP error_log, there may be an entry that states an error -- something vaguely similar to:

Warning:  Invalid argument supplied for foreach() in L:WebServer528_debugadminsourcesproducts.index.inc.php on line 366

For my system, the L:WebServer528_debug would be the root-server-path to this particular installation.

Or perhaps something vaguely similar to:

/home/my_hosting_account_name/public_html/

Dirty Butter · April 3, 2015

OK - well , I thought I knew what the root server path was. I found it on cpanel and now your code is working perfectly!

Thank you, as always, for your patience with my ignorance!!

Dirty Butter · June 9, 2015

Since I'm using this robots obeying sitemap code of yours, I expected to be able to take certificates out of our sitemaps very easily. But it did not work. We have certificates disabled on both stores, so I really don't understand why they end up in the sitemap, anyway.

Any ideas?

bsmither · June 9, 2015

In our experiment with SEO->_sitemap_link(), there is this new code:

// if (!isset($input['url']) && !empty($type)) {
//  $input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true);
// }

if (!isset($input['url']) && !empty($type)) {
  $generated_path = $this->generatePath($input['id'], $type, '', false, true);

  if( !empty($this->_robot_disallows) ) {
    foreach( $this->_robot_disallows as $disallowed ) {
      if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) {
        // From class header: $_robot_disallow_types = array('category','product','document'); // edit to suit
        // There is no $type for sale_items or certificates, so never a match.
        // Returning out early because of a match.
        return false;
      }
    }
  } else {
    // The _robot_disallows array is empty.
  }

  $input['url'] = $store_url.'/'.$generated_path;
}

So, we can see here on line 10 that we will leave early if the URL mmatches a robots exclusion line and the type of link is a category, product, or document.

Looking at SEO->sitemap(), CubeCart calls the _sitemap_link() function with no type specified for the index, sale_items, and certificates pages. Thus, we do not leave early and a link is created.

How do we get certificates to not be in the sitemap?

1. Add a test in SEO->sitemap() to not call _sitemap_link() if the config==certificates is disabled.

2. But if enabled, still call _sitemap_link() but with a special $type value, also added to _robot_disallow_types

(Haven't tested #2) but that might confuse SEO->generatePath() with an unknown $type. So,

3. Add to _robot_disallow_types the array element (boolean) false (that is, not in quotes).

Try setting this at the top of the class file:

$_robot_disallow_types = array('category','product','document', false);

Dirty Butter · June 9, 2015

I wasn't sure where to put the new line - so obvious for you, but not for me

$_robot_disallow_types = array('category','product','document', false);

I put it here in seo.class.php: It does not work. Certificates are Disallowed in robots already, but a new sitemap still has it.

$_robot_disallow_types = array('category','product','document', false);
if (!isset($input['url']) && !empty($type)) {
    $generated_path = $this->generatePath($input['id'], $type, '', false, true);
    if( !empty($this->_robot_disallows) ) {
        foreach( $this->_robot_disallows as $disallowed ) {
             if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { return false; }
        }
    }
    $input['url'] = $store_url.'/'.$generated_path;
}

bsmither · June 9, 2015

Try setting this at the top of the class file:

$_robot_disallow_types = array('category','product','document', false);

At the top of the class file, I mean to say, in the file /classes/seo.class.php, near line 77:

	private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register');
	/** YOU MIGHT HAVE THESE NEXT TWO SECTIONS ALREADY
	 * Robots Disallowed: CubeCart types
	 *
	 * @var array
	 */
	private $_robot_disallow_types   = array('category','product','document'); // edit to suit
	/**
	 * Robots Disallowed: reg_expressions
	 *
	 * @var array
	 */
	private $_robot_disallows   = array();

	/**
	 * Class instance
	 *
	 * @var instance
	 */
	protected static $_instance;

Dirty Butter · June 9, 2015

I'm too tired, and I have too many edits to make sense of what you want me to do.

	/** THIS IS LINE 71
	 * Sitemap XML handle
	 *
	 * @var handle
	 */
	private $_sitemap_xml  = false;
	/**
	 * Static URL sections
	 *
	 * @var array of strings
	 */
	// BEGINNING SEMPERFI TESTIMONIALS 
	// Old Code:
	// private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register');
	// New Code:
	private $_static_sections = array('saleitems', 'certificates', 'trackback', 'contact', 'search', 'login', 'register', 'testimonials', 'addtestimonial');
	// END SEMPERFI TESTIMONIALS 
// BSMITHER ROBOTS SITEMAP HACK 
/**
     * Robots Disallowed: CubeCart types
     *
     * @var array
     */
    private $_robot_disallow_types   = array('category','product','document'); // edit to suit
    /**
     * Robots Disallowed: reg_expressions
     *
     * @var array
     */
    private $_robot_disallows   = array();
// END BSMITHER ROBOTS SITEMAP HACK

v6 How to modify sitemap

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation