Jump to content

Why does Google sometimes "see" Non SEO url's?


Dirty Butter

Recommended Posts

Google sometimes complains about 404's for the non seo version of some of our product listings.

dirtybutter.com/plushcatalog/shop.php/animals/infantino/infantino-blue-green-orange-dog-string-vibrate-crinkle-rattle/p_941

Why does this happen? Is there a way to stop it?

My SEO version does not have shop.php or categories or subcategories or the item number, so where are they "getting" these from???

 

Link to comment
Share on other sites

This is a CC4 SEO URL.

Where they get it? I have no idea, nor do I have any idea on how or where to find out where they got it.

But I would poke around Webmaster Tools. Maybe they have a "Enter a search result URL and we will tell you where we found it" kind of tool.

Link to comment
Share on other sites

There were 67 of these 404 crawl errors showing on Google Webmaster Tools for 7/10. I had about a hundred of this error in my error log all posted at the same time on 7/10 - no way to tell on Google a more specific time the 404 was created.

[10-Jul-2015 17:43:10 America/Chicago] PHP Warning:  preg_match() [<a href='http://docs.php.net/manual/en/function.preg-match.php'>function.preg-match.php</a>]: Unknown modifier '/' in /home3/butter01/public_html/plushcatalog/classes/seo.class.php on line 1049

 

Link to comment
Share on other sites

This is 1029 to the end, with 1039 red bold:

	/**
	 * Create sitemap link
	 *
	 * @param string $input
	 * @param string $updated
	 * @param string $type
	 */
	private function _sitemap_link($input, $updated = false, $type = false) {
		$updated = (!$updated) ? time() : $updated;

		$store_url = (CC_SSL) ? $GLOBALS['config']->get('config', 'standard_url') : $GLOBALS['storeURL'];

		// ORIGINAL B4 BSMITHER ROBOTS SITEMAP HACK if (!isset($input['url']) && !empty($type)) {
		//	$input['url'] = $store_url.'/'.$this->generatePath($input['id'], $type, '', false, true);
		//}
// BSMITHER ROBOTS SITEMAP HACK 
if (!isset($input['url']) && !empty($type)) {
    $generated_path = $this->generatePath($input['id'], $type, '', false, true);
    if( !empty($this->_robot_disallows) ) {
        foreach( $this->_robot_disallows as $disallowed ) {
(line 1039)             if( preg_match('/^'.$disallowed.'/i', $generated_path, $matches) && in_array($type, $this->_robot_disallow_types) ) { return false; }
        }
    }
    $input['url'] = $store_url.'/'.$generated_path;
}
// END BSMITHER ROBOTS SITEMAP HACK 

		$this->_sitemap_xml->startElement('url');
		$this->_sitemap_xml->setElement('loc',  htmlspecialchars($input['url']), false, false);
		$this->_sitemap_xml->setElement('lastmod', date('c', $updated), false, false);
		$this->_sitemap_xml->endElement();
	}
}

 

Edited by Dirty Butter
Link to comment
Share on other sites

preg_match('/^'.$disallowed.'/i', $generated_path, $matches)

When PHP substitutes the current value of $disallowed into the string, that value may have a slash as one of the characters. So the result may be:

preg_match('/^path/to/cat/i', $generated_path, $matches)

These inadvertent slashes interfere with the delimiters that define the start and end of the string to match on.

We need to use different delimiters - characters that will probably never appear as part of a URL in the robots file.

Try:

preg_match('#^'.$disallowed.'#i', $generated_path, $matches)
Link to comment
Share on other sites

  • 4 weeks later...

Using Google to search "plushcatalog/shop.php" (with quotes) gives several hits of other sites that have links in the CC4 style to your site -- including plushmemories.com.

But the directives in .htaccess should find and correct for that:

#### Rewrite rules for SEO functionality ####
<IfModule mod_rewrite.c>
  RewriteEngine On
  RewriteBase /plushcatalog ### IMPORTANT! ###


  ######## START v4 SEO URL BACKWARD COMPATIBILITY ########
  RewriteCond %{QUERY_STRING} (.*)$
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteRule cat_([0-9]+)(\.[a-z]{3,4})?(.*)$ index.php?_a=category&cat_id=$1&%1 [NC]

etc

 

 

Link to comment
Share on other sites

I don't actively manage plushmemories.com any more, as it got way too big and time consuming. I moved it to Facebook, where it's working quite well. I found two and fixed them in plushmemories, but one was in a comment and the Regex couldn't find any more. Obviously there are more somewhere. I have over a hundred of these error links listed on Google from the last few months.

I don't want to take plushmemories offline. The images and descriptions, plus the link to the Facebook group, etc., are too helpful to many people.

Link to comment
Share on other sites

I wonder if the web server takes note of the fact that a subdirectory named /plushcatalog/ actually exists, then looks at the .htaccess file in that folder -- if the .htaccess file for CubeCart is there.

Have you tried putting CubeCart's .htaccess rewrite directives in the WP .htaccess file?

Link to comment
Share on other sites

I have the plushcatalog htaccess merged into the domain one as best I could - that certainly doesn't mean I did it correctly.

## File Security
<FilesMatch "\.(htaccess)$">
 Order Allow,Deny
 Deny from all
</FilesMatch>
#### Apache directory listing rules ####
DirectoryIndex index.php index.htm index.html
IndexIgnore *

<IfModule mod_expires.c>
# Enable expirations
ExpiresActive On
# Default directive
ExpiresDefault "access plus 1 month"
# My favicon
ExpiresByType image/x-icon "access plus 1 year"
# Images
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
ExpiresByType image/jpg "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
</IfModule>

<IfModule mod_deflate.c>
    <IfModule mod_headers.c>
        Header append Vary User-Agent env=!dont-vary
    </IfModule>
        AddOutputFilterByType DEFLATE text/css text/x-component application/x-javascript application/javascript text/javascript text/x-js text/html text/richtext image/svg+xml text/plain text/xsd text/xsl text/xml image/x-icon application/json
    <IfModule mod_mime.c>
    
        # DEFLATE by extension
        AddOutputFilter DEFLATE js css htm html xml
    </IfModule>
</IfModule>


#### Rewrite rules for SEO functionality ####
<IfModule mod_rewrite.c>
  RewriteEngine On

RewriteCond %{HTTPS} off


RewriteRule (.*) https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
RewriteBase /
  ######## START v4 SEO URL BACKWARD COMPATIBILITY ########
  RewriteCond %{QUERY_STRING} (.*)$
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteRule cat_([0-9]+)(\.[a-z]{3,4})?(.*)$ index.php?_a=category&cat_id=$1&%1 [NC]

  RewriteCond %{QUERY_STRING} (.*)$
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteRule prod_([0-9]+)(\.[a-z]{3,4})?$ index.php?_a=product&product_id=$1&%1 [NC]

  RewriteCond %{QUERY_STRING} (.*)$
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteRule info_([0-9]+)(\.[a-z]{3,4})?$ index.php?_a=document&doc_id=$1&%1 [NC]

  RewriteCond %{QUERY_STRING} (.*)$
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteRule tell_([0-9]+)(\.[a-z]{3,4})?$ index.php?_a=product&product_id=$1&%1 [NC]

  RewriteCond %{QUERY_STRING} (.*)$
  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteRule _saleItems(\.[a-z]+)?(\?.*)?$ index.php?_a=saleitems&%1 [NC,L]
  ######## END v4 SEO URL BACKWARD COMPATIBILITY ########


  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_FILENAME} !-d
  RewriteCond %{REQUEST_URI} !=/favicon.ico
  RewriteRule ^(.*)\.html?$ index.php?seo_path=$1 [L,QSA]
 
  #301 Redirect Old File
Redirect 301 https://dirtybutter.com/1969-chevrolet-caprice/photo-gallery-1969-chevrolet-caprice/  https://dirtybutter.com/1969-chevrolet-caprice/1969-chevrolet-caprice-photos/

Options -Indexes
ErrorDocument 404 /index.php?_a=404
ErrorDocument 400 /400-bad-request.html
ErrorDocument 401 /401-restricted-page.html
ErrorDocument 500 /500-server-problem.html
ErrorDocument 302 /302-moved-temporarily.html

# Use PHPcur as default
AddHandler application/x-httpd-phpcur .php
<IfModule mod_suphp.c>
    suPHP_ConfigPath /opt/phpcur/lib
</IfModule>

</IfModule>
RewriteCond %{HTTP_HOST} ^dirtybutter\.com$ [OR]
RewriteCond %{HTTP_HOST} ^www\.dirtybutter\.com$
RewriteRule ^plushcatalog\/$ "https\:\/\/dirtybutter\.com\/plushcatalog\/" [R=301,L]
RewriteCond %{HTTP_HOST} ^www\.dirtybutter\.com$
RewriteRule ^/?$ "https\:\/\/dirtybutter\.com\/" [R=301,L]


# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

# END WordPress

Link to comment
Share on other sites

I don't know why, but my issue is BACK! Google is seeing the v4 form url's again. UGH!!!!! I have the WP .htaccess section LAST in my file. I would assume the server would deal with the plushcatalog re-directs BEFORE it ever gets to the WP files. WP is in a wp folder, but I'm using the redirect that makes it show on the domain as home. Is THAT the issue???

Link to comment
Share on other sites

I've started from scratch again with this issue. This is what I did in debugging:

I changed my WP site so it was not on the domain root, but on dirtybutter.com/wp and put my old dirtybutter.com page up instead. This involved changing the htaccess file back to the default hostgator one.

When I tested one of the v4 urls, it still did not resolve to the v6 version, but showed the plushcatalog 404 page instead.

So whatever the issue is, it is NOT happening because of my WP install on the domain root.

So I put the WP install back on the root, as it was originally.

I DID change the root WP htaccess section to the Multi-site version, and now the error message for the v4 version link goes to the plushcatalog 404 page. ONE PART FIXED

But I AM still getting the error page, not the correct url for the product. It is STILL not redirecting to the SEO friendly URL.

Link to comment
Share on other sites

If your 'troublesome' URLs are still of this syntax:

...vibrate-crinkle-rattle/p_941

then the .htaccess rule to deal with CC4 URLs are not going to work. The rewrite rule is:

prod_([0-9]+)(\.[a-z]{3,4})?$ index.php?_a=product&product_id=$1&%1

I will have to say that the URLs you are getting is not a standard CC4 SEO style. It may be a SEO mod style for CC3 or early CC4.

You can try to catch that stryle of SEO URL by using this rewrite rule:

Instead of:
RewriteRule prod_([0-9]+)(\.[a-z]{3,4})?$ index.php?_a=product&product_id=$1&%1 [NC]

Use:
RewriteRule p(rod)?_([0-9]+)(\.[a-z]{3,4})?$ index.php?_a=product&product_id=$2&%1 [NC]

This now looks for:

p, then rod which may or may not be there, then the underscore, then the product ID number which is one or more digits from 0 to 9, then a period and three or four letters that may or may not be there. Because another set of parenthesis was added, the product ID is now in the second set of parenthesis ($2 instead of $1).

Link to comment
Share on other sites

THAT DID IT!! I don't use category/subcategory in my url's, and I've never noticed one of these odd url's for the other url's in the SEO htaccess defaults. It may be there have been some, but they were buried in all the hundreds of product urls, and I just didn't see them.

I've been at the point of throwing my laptop over this (felt like it, but didn't :) ), but as long as I was blaming it on the WP install I was never going to get it fixed. We'll see if any of the other url's need to be tweaked, so I'll wait awhile to change this to Resolved.

THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU - THANK YOU -

Well, I just marked as fixed a lot of url's on Google, but there were two of a different type:

https - dirtybutter.com/plushcatalog/extra/prodImages.php?productId=821

I don't have, and frankly don't remember ever having, a folder named extra - how should I word a Redirect to take care of these?

There were only two of those, first seen by Google just a couple of days ago, so I'm going to mark them as "fixed" and see if they show up again, before I worry about them. As much as I've been fiddling with stuff, trying to find the issue, no telling how they were created.

 

Link to comment
Share on other sites

/plushcatalog/extra/prodImages.php?productId=821

I recognize this as CC3's method of showing the secondary/additional images that are associated with a product. This link on CC3's View Product page will toss up a new popup window with a gallery of sorts..

We can try to write a rewrite rule to capture this and deliver the product's page. I will need to study this because we are looking at the querystring (?productId=821) and not the URL, although we will need to trigger on the URL.

Add somewhere along side the v4 SEO rules:

RewriteCond %{QUERY_STRING} ^productId=([0-9]+)$
RewriteRule /extra/prodImages\.php$ index.php?_a=product&product_id=%1 [NC]

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...