Jump to content

[Resolved] catalog csv export problem


cutecube

Recommended Posts

Running 6.06. When I export a catalog as csv, the resulting file has html paragraph markup code in it.
I right click on the '1' in the export screen and click 'save as'. It saves the file as a csv file.
 Spreadsheets don't seem to like this by not displaying the description field. Why are they there? Am I doing something wrong?

"Fairy's Hanging Heart Key Necklace","1","1","SPI016TX82","0.000","<p>
    Silver birthstone key necklace. Customize with your choice of birthstones.</p>

 

I have to remove the <p> and </p>, remove what looks like a tab after the first <p>, and delete the character after the period in the description field, in order for it to show up in a spreadsheet properly.

 

 

spreadsheet.jpg

Link to comment
Share on other sites

Technically, the CSV file syntax allows a field to contain any manner of characters if that field is enclosed in quotes. A quote itself is escaped by using a pair of quotes together.

That does not help you in this situation. I would say that a few spreadsheet applications do not honor this syntax rule. I know that MS Excel 2003 and prior do not. Perhaps other applications or later versions of Excel would.

To enact a work-around, we can make a code edit to CubeCart, or compose a code snippet. A code snippet would provide a "tab delimited file" (filename.tab), which then the current CubeCart export code would strip all HTML and other print control characters.

Link to comment
Share on other sites

Not to sound belligerent, but I've been using CSV files for over 35 years and fully understand the structure of them. Yes, I'm that old. What I don't understand is why those characters are in there in the first place? It looks like it was meant to be shown on a web page. With those added characters in the raw csv file, it make it almost useless for almost anything you want to do with  it. Why is it just associated with the 'Description' field?

I use OpenOffice/LibreOffice Calc and I've tried every combination of import parameters to try to get it to work, but after 24 tries with different combinations, I still couldn't make it import properly, so it has nothing to do with quotes, tabs, comas, periods, spaces, etc.

In my opinion, the operation that creates the csv export file is incorrectly designed. I've never seen a csv file contain those characters in my life.

I can write a quick Bash script to correct the problems. However, I'd like to see the problems corrected at the source. I don't know who might be in charge of that particular code, but if they can tell me where it's located, I'd be happy to look at it to see what is wrong with the code and maybe correct it.

Link to comment
Share on other sites

The product's Description (and new Short Description) field is composed in the provided Rich Text Editor in CubeCart's administration Add/Edit Product, Description tab. Having the resultant HTML source code come out of the editor 'pretty-printed' is an apparent decision by the programmers of that editor (CKEditor). I do not know if there is an editor config setting that switches off 'pretty-printing' - that is, outputs one long single line of HTML source code.

For other than CSV export, CubeCart's export code also removes the pretty-printing and HTML tags from the product's Name field.

Please open the following file for editing:

/admin/sources/products.export.inc.php

Fine near line 75:

	if ($extension == 'csv') {
		$result['name']   = str_replace('"', '""', $result['name']);
		$result['description'] = str_replace('"', '""', $result['description']);
	} else {
		$result['name']   = preg_replace('#[\s]{2,}#', ' ', str_replace(array("&nbsp;", "\t", "\r", "\n", "\0", "\x0B"), '', strip_tags($result['name'])));
		$result['description'] = preg_replace('#[\s]{2,}#', ' ', str_replace(array("&nbsp;", "\t", "\r", "\n", "\0", "\x0B"), '', strip_tags($result['description'])));
	}

Make those lines look like:

//	if ($extension == 'csv') {
//		$result['name']   = str_replace('"', '""', $result['name']);
//		$result['description'] = str_replace('"', '""', $result['description']);
//	} else {
		$result['name']   = preg_replace('#[\s]{2,}#', ' ', str_replace(array("&nbsp;", "\t", "\r", "\n", "\0", "\x0B"), '', strip_tags($result['name'])));
		$result['description'] = preg_replace('#[\s]{2,}#', ' ', str_replace(array("&nbsp;", "\t", "\r", "\n", "\0", "\x0B"), '', strip_tags($result['description'])));
//	}

I appreciate your opinion and viewpoint. I assert this is not a "bug", but rather how CubeCart programmer's decided to implement the CSV file format according to RFC-4180 (which is non-binding, of course).

I have not had a need to get intimately familiar with CSV. The few times I've loaded a CSV in Excel has always resulted in frustration and disappointment because the spreadsheet made assumptions about the data that spoke to the application's intended use -- that of crunching numbers and 'what-if' analysis, and not as an ersatz text editor.

If I were to find myself working with CSV files, I would use an app specifically built for those files.

Link to comment
Share on other sites

Thanks a lot, bsmither. I appreciate you taking the time to create those changes. I've noticed that you really put a lot of effort into helping people here.

I may be wrong, but those two code snippets look identical to me. Saved each and did a 'diff' on them. Nothing showed different.

Before I go off half-cocked, and do something to mess things up, can you explain why the 'export' code was designed that way? I have not looked at the actual database field nor have I used a Hex Editor to look at the resulting csv export file to determine exactly what characters are there that prevent the file from loading into a normal spreadsheet program. Removing those extra characters by hand, allows the file to load in any spreadsheet program, so I can't figure out why it was designed the way it is.

What did the designers expect the resultant export csv file to be used for?

Link to comment
Share on other sites

"Those two code snippets look identical to me."

The group of statements that follow "Make those lines look like" have double slashes in front of five of the seven statements.

IN PHP, a double slash starts a comment that lasts to the end of the line, and comments are not executed. Thus, the effect of commenting five of the seven statements makes the only two remaining unedited statements the only statements that will be executed.

"Saved each and did a 'diff' on them. Nothing showed different."

That is very disconcerting. I would like to see a screenshot of how your browser is displaying the post above.

"Can you explain why the 'export' code was designed that way?"

Previously, I said:
[This is] how [the] CubeCart programmer's decided to implement the CSV file format according to RFC-4180.

I fault myself for speaking for the programmers when I actually have no knowledge at all as to why they coded the export code to allow for troublesome (but not illegal) characters in a CSV format. I was only assuming that since the output follows RFC-4180, they must have been using RFC-4180 to code the export format.

"What did the designers expect the resultant export csv file to be used for?"

(Not speaking for the programmers) Perhaps a far wider range of applications than just a spreadsheet - and perhaps with the intent to maintain data integrity by transferring that data without any kind of modification (that would otherwise be needed to fit the unyielding and unknown requirements of any given application).

"[I have not been able] to determine exactly what characters are there that prevent the file from loading into a normal spreadsheet program."

It is the newline characters (carriage return -- CR:x0D for *nix, carriage return line feed -- CRLF:x0D0A for Windows, line feed -- LF:x0A for Macintosh) found in the product description's pretty-printed source HTML that is being interpreted by the spreadsheet import as a delimiter between records (spreadsheet rows). Then, I have also seen the tab character (HT:x09) be interpreted as a field delimiter even though the spreadsheet import was specifically told that only commas not enclosed in quotes delineate fields.

Link to comment
Share on other sites

I'm sorry. I didn't realize that you meant to comment out those lines. I'm so used to looking at code that I copied the second set of lines and removed the double slashes, thinking that somehow they related to how you pasted the code. I feel really stupid now.

Thanks for your insight. Although I would think that they would store the data as clean as possible so it could be used in many different ways, I guess they had their reasons, which escape my logic reasoning.

I'll probably write a simple SQL statement to extract the data, clean it, and put it into a csv file.

Thanks again and keep up the great work!

p.s. How to I mark this as solved?

Edited by cutecube
Link to comment
Share on other sites

"I would think that they would store the data as clean as possible so it could be used in many different ways."

Perhaps.

I am with the school of thought that XML files are far better at maintaining data integrity in the transport phase, the import/interpretation phase, and when re-purposing that data.

Link to comment
Share on other sites

   Although I would think that they would store the data as clean as possible so it could be used in many different ways

Generally speaking, data should always be stored exactly as it was entered. Otherwise, you end up making a lot of assumptions while 'cleaning' that could destroy the data integrity.

For example, say you are cleaning phone numbers to match 123-4567-8901 format; well what if I have an extension? What if I live in a different country? Etc. etc. It forces you to think of a lot of 'possible' cases that you can't hope to ever fully encompass, so storing the data as it was entered is the only way to maintain its integrity.

Then, when you actually want to DO something that said data, that's when you clean it up to your expected format. If there are any problems, e.g. an entry doesn't fit the formatting, you can inspect what was input and adjust your formatting code accordingly, whereas if you had forced it into that format from the beginning, you wouldn't be able to do that.

With the HTML example, if you stripped all the HTML code, how would you go about displaying that data back as HTML? All of the formatting data the user had entered would be lost.

Anyway, hope that gives some insight into why a programmer might go the route chosen by CubeCart. In my opinion, it is the only choice. ;)

Link to comment
Share on other sites

   Although I would think that they would store the data as clean as possible so it could be used in many different ways

With the HTML example, if you stripped all the HTML code, how would you go about displaying that data back as HTML? All of the formatting data the user had entered would be lost.

Anyway, hope that gives some insight into why a programmer might go the route chosen by CubeCart. In my opinion, it is the only choice. ;)

Completely agree !

Link to comment
Share on other sites

As any good programmer will tell you, you should always keep the data and structure separate. This way, you can format the data any way you want to using HTML, C++, etc. You should never store structure (HTML or any other code) with the data. The presentation of the data should be under complete control of the presentation of the data, no formatting information should ever be stored in the data.

The HTML should be in the webpage to display the data in whatever manner you want. The data (in the database) should have no presentation code stored with it. That's just normal and proper program design.

Link to comment
Share on other sites

Sort of. (I am still working with Microsoft Word 2003 (because I hate the Ribbon user interface).)

But doesn't the new DOCX format have content separate from presentation? I once went digging through the DOCX format and, if I recall, there was a separate file for the markup presentation and the file holding the content was in something that looked like an XML file (if not exactly an XML file).

I assert the exact gripe of the OP is that the CKEditor output is pretty-printed. The intra-field line endings is the obstacle to a successful import into the spreadsheet. I suppose if the CKEditor did not pretty-print its output, this conversation wouldn't have happened.

And, FWIW, HTML markup does not constitute the presentation layer. The HTML tags informs the rendering application what something could be, but it is entirely the purvue of the rendering application to display that content in a style based on its own needs or from that of style sheets.

Also, FWIW, pretty-printing is a preference of programmers. In HTML. there is (probably) absolutely no need for indentation and line endings in the code.

Link to comment
Share on other sites

"And, FWIW, HTML markup does not constitute the presentation layer. The HTML tags informs the rendering application what something could be, but it is entirely the purvue of the rendering application to display that content in a style based on its own needs or from that of style sheets."

Technically correct. :-)

I just did a quick Google search for CKEditor alternative and produced at least 15 other applications. I didn't do any research beyond that.

In Cubecart, what is the purpose of having a WYSIWYG editor in the first place. Aren't you just typing a description of the product? Why would you need to embellish it? Couldn't you do that in CSS, where it's supposed to be done?

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...