Reading ASCII characters

greenspun.com : LUSENET : Steve Heller's books : One Thread

Now I'm curious about this ASCII character redefinition idea. You say that I correctly surmised the situation of the program taking one of those characters to mean "stop reading this line", and I assume that another has been altered to mean "end of this field". So my question is, which characters? Are there more? If you could just give me the values of all ASCII characters with double identities when read in from a C++ program I'd appreciate it.

This is more than just idle curiosity, you see. I'm hoping to use this in a program that will take a text document (as scanned in from a page and run through an OCR) and break the text down into it's individual words and symbols. Then take the resulting word list and throw out all duplicates, compare the document's word list against one's own word list, and spit out which specific words from the document one needs to have defined.

In order to do this, I need to know all I can about exactly how to read in characters from a file on disk. So far I know only how to read in a file that was written out in a particular controlled sequence (like the Stockitem inventory files) , not one that will contain random words and symbols. Any advice on how to read a file character by character? If you take the time to answer that for me I'll be happy to send you a copy of the finished program so you can see what you've started.

-- Mike Mannakee (B3FEETBACK@AOL.COM), May 09, 1999

Answers

The original definition of the line feed was that it caused the teletypewriter to advance to the next line, whereas the carriage return character returned the carriage to the beginning of the line. As I said, these are historical artifacts from the dawn of computing. The application here is that the newline, consisting of either a line feed character or a line feed character followed by a carriage return (depending on the operating system) is used in text editors to indicate the end of line. Thus, if you write a file with the data elements separated by a newline, it will be readable in most text editors. This simplifies figuring out what your program is doing.

There is a list of these "control characters" in the back of most programming reference manuals. Most of them aren't used very much any more, because we don't use teletypewriters for output; controlling the output of such devices was their main function.

All you should need is the newline and space characters to write your program. Not that you shouldn't know about the other control characters, but they aren't relevant to your question.

To read arbitrary characters from a file, I would use "get", which takes one argument of char type and reads the next character from the input stream. For example, if you're reading from a stream called instream, it would go like this:

while (instream.get(c))
  {
  // process
  }

When there aren't any more characters left in the input stream, the condition will be false, so the loop will end.

-- Steve Heller (stheller@koyote.com), May 09, 1999.


Great. Now, when you say that get takes an argument of char type, do you mean that I have to pass it a char as an argument and it will return the next char in the stream? Does that mean I could pass it "short'" or some other type as an argument and it would read in the next data as a short or whatever type I request? I wasn't sure what you meant.

Now, in your book you ask to let you know if I'd be interested in reading a book on class design, and of course the answer is yes. I actually expect to see a lot more books out of you as time passes. Your work will come to be acknowledged as the start of the revolution in technical references. Tell your editor to just shut up and publish whatever you want to give him. He'll profit from it.

-- Mike Mannakee (B3FEETBACK@AOL.COM), May 09, 1999.


The argument to get

The get function has only one possible argument type, char, because it is meant for reading unformatted data one character at a time. Of course, it is possible to write a function that reads arbitrary types in the same way, but you'd have to do that yourself. You could even call it "get"; so long as it has a different argument list, the compiler can tell it from the one in the library that takes a char.

As for my next book, it is going to be sort of a "technical memoir". It's about my 30 years in programming and how I would have done some of the projects I've worked on if I'd had C++ available. For more recent projects, it will discuss how I would have done them if I'd known C++ better.

Unfortunately, that may be my last book. Editorial interference isn't the problem, though. The difficulty is that I can't get my publisher to promote my books, so they don't sell enough copies for me to make a living. They have a unique product that gets raves from customers, but they won't spend any money to tell other potential customers about it. This is very frustrating, as you can easily imagine.

-- Steve Heller (stheller@koyote.com), May 09, 1999.


Moderation questions? read the FAQ