Tokenizing - CompWisdom
About us  |  Why use us?  |  Press  |  Contact us

 

Topic: Tokenizing



  
 Token (parser) - Wikipedia, the free encyclopedia
In computing, a token is a primitive block of a structured text.
A lexical analyzer such as lex reads a text, and outputs the tokens.
Generally, whitespace is ignored, but sometimes it is important and tokenized.
http://en.wikipedia.org/wiki/Token_(parser)   (218 words)

  
 Tech Tips: June 23, 1998
In the string tokenizing example above, a string is tokenized, and the tokens are retrieved and printed in turn.
In this example tokens are numeric values that represent fields of some type.
In the example, a string of numeric test data is set up, and a StringTokenizer object is established to parse the string.
http://java.sun.com/developer/TechTips/1998/tt0623.html   (616 words)

  
 Motivation
Tokenization is a separate programming task (separate from reading input on the one hand, and parsing on the other hand).
Not only do we argue for considering tokenization as a separate task, we also argue that there are good reasons to implement such a tokenizer by means of a specialized program which constructs such a tokenizer on the basis of a number of regular expressions.
A further disadvantage for integrating reading and tokenizing is that such an integrated approach must take extreme care that in the context of errors in the tokenizer the input channel is properly reset.
http://odur.let.rug.nl/vannoord/papers/elex_prolog/node1.html   (828 words)

  
 SpamBayes: Background Reading
Before any change to the tokenizer or the algorithms was checked in, it was necessary to show that the change actually produced an improvement.
Tokenizing of headers is an area of much experimentation.
It's likely that remaining "easy wins" in the tokenizing are buried in some new way of tokenizing the headers.
http://spambayes.sourceforge.net/background.html   (2095 words)

  
 [No title]
The apparatus of claim 23, wherein if the document is of a mixed token type, the tokenizing mechanism is configured to use a first set of tokenizing instructions to tokenize a first section of the document, and to use a second set of tokenizing instructions to tokenize a second section of the document.
In one embodiment of the present invention, tokenizing the document involves using a first set of tokenizing instructions to tokenize a first section of the document, and using a second set of tokenizing instructions to tokenize a second section of the document.
The computer-readable storage medium of claim 12, wherein tokenizing the document involves using a first set of tokenizing instructions to tokenize a first section of the document, and using a second set of tokenizing instructions to tokenize a second section of the document.
http://www.wipo.int/cgi-pct/guest/getbykey5?KEY=01/50327.010712&ELEMENT_SET=DECL   (3691 words)

  
 Rule Sets
If two adjacent tokens are just text strings and neither is a tokenizing character, Sendmail creates the string by using the value of the first string, plus the value of the Blank Sub option, followed by the value of the second string.
If a match is not made by the fewest number of tokens, then Sendmail will add the next token and try to make the match again until a match is made or it runs out of tokens.
When a pattern matching operator can match multiple tokens, Sendmail uses a minimal matching algorithm.
http://alanpae.tripod.com/sendmail/Rule_Sets.htm   (3553 words)

  
 Elementary Language Processing: Tokenizing Text and Classifying Words
We further tokenize this into its component word tokens, and the result is stored as an attribute of the original token.
To summarize, we have just illustrated how, at its simplest, tokenization of a text can be carried out by converting the single string representing the text into a list of strings, each of which corresponds to a word.
Analogously, a parser will expect its input to be a sequence of word tokens rather than a sequence of individual characters.
http://nltk.sourceforge.net/tutorial/tokenization/nochunks.html   (5948 words)

  
 String Tokenizing
which return the number of tokens that are delimited by any white space in a given string, thus we know how many tokens there are and therefore can use this number as a loop parameter with which to process the string.
We can also use the string tokenizer to process input from a file line by line as shown in Table 9.
If we knew that the input file did not contain any blank lines, one way of avoiding the need to know in advance the number of lines in the input file, is to process the file until a line with no tokens is found and assume that this is the end of the file.
http://www.csc.liv.ac.uk/~frans/COMP101/AdditionalStuff/tokenizing1.html   (1293 words)

  
 Boost Char Separator
Whenever a delimiter is seen in the input sequence, the current token is finished, and a new token begins.
We also specify that empty tokens should show up in the output when two delimiters are next to each other.
We refer to delimiters that show up as output tokens as kept delimiters and delimiters that do now show up as output tokens as dropped delimiters.
http://www.boost.org/libs/tokenizer/char_separator.htm   (531 words)

  
 [No title]
Either the tokenizer needs to have an algorithm for redefining and automatically determining its characters classes (I can't convince myself that this is even a terminating algorithm or a computable problem since it edges toward arbitrary constraint satisfaction), or the user must character by character control the tokenizer as one can do in Common Lisp.
This buys static tokenizing and allows programmers to consistently see infix operator usage when scanning programs, no weird characters used in funny plces to make you read a line two or three times.
I claim this is much more functionality and hair to Dylan (even though it is a solved problem in Common Lisp) than is worth the small flexibility to introduce anything as an operator.
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/gwydion-1/dylan-history/1994/01/01710   (877 words)

  
 CS 157 Introduction to Database Management Systems -FALL, 1998
In tokenizing your data, you need to combine the software tools and human judgment.
Take each sorted file from Home work #1 and tokenize it; the format is in the metadata table.
Take large sorted file from Home work #2 and tokenize it; the format is in the OracleSetup file.
http://www.mathcs.sjsu.edu/faculty/tylin/class/cs157A/auto_bk.html   (457 words)

  
 Tech Tips: July 23, 1998
The class uses an internal table to control how tokens are parsed, and this syntax table can be modified to change the parsing rules.
StreamTokenizer operates on input streams rather than strings, and each byte in the input stream is regarded as a character in the range '\u0000' through '\u00FF'.
The method resetSyntax is used to clear the internal syntax table, so that StreamTokenizer forgets any rules that it knows about parsing tokens.
http://java.sun.com/developer/TechTips/1998/tt0722.html   (647 words)

  
 [No title]
For tokenizing purposes, operators always divide one token from another, just as the characters in the master list did.
When two tokens of simple text are pasted together, the character defined by the
is a wildcard operator) or replacing tokens with others by position (
http://www.gamerz.net/rrognlie/batbook/1565928393_sendmail3-chp-18-sect-3.html   (655 words)

  
 Tokenizing
Either way, because the document is now "described" via the tokenizing process, the presentation of the data can now be separated from the logic required for the formatting of it.
So, the string: "This is a nice blog," is tokenized by your brain, by chunking out the various words from the mess of characters.
I always think of "tokenizing" as the recognition of tokens.
http://weblogs.asp.net/dneimke/archive/2003/06/12/8553.aspx   (911 words)

  
 Re: Tokenizing a String?
I have a tokenizer class that might work for you.
I feel like I could write a tokenizing >function on my own, but it would probably involving looping through, >examining every character of the string, and would be unusably slow.
But, it is fairly fast -- it can break a 4MB text file into 800,000 tokens in < 3 seconds -- and it might suit your needs.
http://support.realsoftware.com/listarchives/cgi-bin/mesg.cgi?a=realbasic-nug&i=p05200f05bb5987a71b53@[192.168.1.2]   (211 words)

  
 Rainbow
Rather than tokenizing the file directly, pass the file as standard input into this shell command, and tokenize the standard output of the shell command.
When indexing a file, rainbow turns the file's stream of characters into tokens by a process called tokenization or "lexing".
Once indexing is performed and a model has been archived to disk, rainbow can perform document classification.
http://www.cs.umass.edu/~mccallum/bow/rainbow   (2759 words)

  
 [z-machine] Tokenizing
But if you set up a line of text in memory, with capital letters, and used the tokenize opcode, then it would match dict words with capital letters.
So your '#Record' word would never match the input.
http://www.feelies.org/pipermail/z-machine/2005/000133.html   (217 words)

  
 XTF Under the Hood
After tokenizing, alll upper-case letters within tokens are converted to lower-case, which allows queries to be case-insensitive.
See Tokenizing below for details of how terms are parsed.
The following table gives examples of character strings that the tokenizer recognizes as terms.
http://xtf.sourceforge.net/WebDocs/HTML/XTF_Under_Hood/XTFUnderHood.html   (6753 words)

  
 ETL Functions For Narrative Data Mart Population
The token-recognition rules underlying search engines are quite basic compared to the more advanced software that can apply additional rules to the word-tokens output from a search engine.
For example, in an input document, if the token “D.C.” immediately follows the token “Washington,” our confidence is increased that the combination designates the name of a Location, rather than a Party such as “Washington Irving.”
Here’s a news flash: if you think you’re searching the web when you submit a search query using Yahoo!, Google or Alta Vista, you’re wrong—what you’re actually doing is querying a data mart.
http://www.b-eye-network.com/view/1238?jsessionid=af7ca00a74cc467154c27677a7d74ee4   (1523 words)

  
 CS362 2002F : Tokenizing with java.io.StreamTokenizer
In particular, you will investigate its default behavior, update it to tokenize Scheme, and consider its fitness for our lexical analyzer.
Are there other kinds of tokens you need to consider?
Using the general structure of your modified Scheme token printer and
http://www.math.grin.edu/~rebelsky/Courses/CS362/2002F/Labs/lab.04.html   (500 words)

  
 [Tutor] Issues with tokenizing a string
On Thursday 21 August 2003 12:25, Vicki Stanfield wrote: > I am trying to tokenize a string and then test on the values of the > tokens, some of which might be empty.
The wxTextCtrl will contain zero to several > space-delimited "tokens".
> > self.arguments = self.parameters.GetValue() > is what I use to read the string to tokenize.
http://mail.python.org/pipermail/tutor/2003-August/024863.html   (201 words)

  
 PHP Bugs: #24550: tokenizing & syntax highlighting crashes
I suspect the tokenizer does not properly prepare for parsing, but have no idea how to investigate further.
I need this one to be fixed for phpDocumentor 2.0 to work at all wiht __METHOD__ in source code.
Workx good with __FUNCTION__ constant, now I'm gonna use __FUNCTION__ in my script, but might be problem one day for others...
http://bugs.php.net/bug.php?id=24550   (412 words)

  
 Tokenizing infinite loop - GameDev.Net Discussion Forums
Posted - 1/9/2005 5:13:00 AM If your delimiter is just whitespace, there is a simple way to tokenize, by using a
Posted - 1/9/2005 3:53:42 AM // tokenize while(index != string::npos){ // push word string::size_type next_space = m_commandLine.find(' '); arguments.push_back(m_commandLine.substr(index, next_space)); // increment index if(next_space != string::npos) index = next_space + 1; else break; }
Posted - 1/9/2005 4:02:09 AM You aren't changing where it starts looking from, so it will find the first space every time.
http://www.gamedev.net/community/forums/topic.asp?topic_id=293276&whichpage=1&   (363 words)

  
 MLTT, RXRC: Lexical Tokenization - public cvs branch
A possibly non-deterministic tokenizer can be applied where several tokenization results are allowed for a given string.
This is slower but more powerful than the default deterministic tokenizing algorithm.
The output will be tokenized, one token on a line.
http://www.cis.upenn.edu/~cis639/docs/tokenize.html   (244 words)

  
 "Stripping" VBScript prior to tokenizing - dBforums
to create a wrapper for yylex that acts like a token buffer.
I'm working on being able to tokenize and parse VBScript using VBScript.
http://www.dbforums.com/showthread.php?t=567697   (2113 words)

  
 XML Little Languages
The lexical analysis phase of a language parser takes the tokens and ensures that the right tokens appear in the right place, and builds a tree representing the structure of the language as a result.
In general terms, a language parser splits the input language into recognizable tokens that it understands.
The tokenizing phase of an XML based little language is performed by the XML parser, and the recognized tokens we return are SAX events.
http://xml.sergeant.org/xmldevcon/littlelang.dkb?section=3   (158 words)

  
 tokenizing file parser example
tokenizing file parser example - by Borland Developer Support Staff
http://community.borland.com/article/0,1410,20493,0.html   (34 words)

  
 Signs on the Sand: Comment on Tokenizing in XSLT
A variety of problems have been solved using this functional tokenizer -- just have a look in xsl-list.
Or alternatively you could use a native binary written in c++ which is much faster, especially for large files.
Posted by: Dimitre Novatchev at May 2, 2003 05:59 PM
http://www.tkachenko.com/cgi-bin/mt-comments.cgi?entry_id=14   (96 words)

  
 Page 3
This is because a tokenizing transducer will be applied by the
it should be possible to write an upward oriented tokenizing transducer more
· The key rule for the first tokenizing example (p.
http://www.stanford.edu/~laurik/fsmbook/clarifications/tokfst-3.html   (117 words)

  
 SQL Server Forums at SQLTeam.com - Tokenizing a String
Programming today is a race between software engineers striving to build bigger and better idiot-proof programs, and the Universe trying to produce bigger and better idiots.
wondering if anyone knows an example of implementing a string tokenizer
SQL Server Forums at SQLTeam.com - Tokenizing a String
http://www.sqlteam.com/Forums/topic.asp?TOPIC_ID=50395   (230 words)

  
 Tokenizing
The parsing process is controlled by syntax tables and flags.
The attributes are white space, alphabetic, numeric, string quote, and comment character.
Each token that is parsed is placed in a category.
http://longwood.cs.ucf.edu/~workman/cop3330/JavaTutorial/j-javaio/j-javaio-12-5.html   (221 words)

  
 [No title]
Usually the query is fairly short, and therefore its vector is extremely sparse.
Product of token weights is zero and does not contribute to the dot product.
http://www.cs.utexas.edu/users/mooney/ir-course/slides/Implementation.ppt   (188 words)

  
 XML.com: XSLT 2 and Delimited Lists
Without storing the tokenized sequence in a separate variable as the previous example did, this example's
Let's look at a tokenizing example that attacks a more realistic problem, the SVG polygon element shown above.
The XPath 2.0 spec has more about the new sequences.
http://www.xml.com/pub/a/2003/05/07/tr.html   (1214 words)

  
 [No title]
This is incorrect; the reason the JavaScript-triggered loads prevent tokenizing is that they take place "right away" in other browsers, but that is not true of
Another problem caused by the same fix is that
* khtml/html/htmltokenizer.cpp: (HTMLTokenizer::write): Changed to use isScheduledLocationChangePending instead of isImmediateRedirectPending, because we do want to continue tokenizing if it's actually a redirect.
http://www.opensource.apple.com/darwinsource/10.3.6/WebCore-125.8.10/ChangeLog   (10641 words)

  
 Tokenizing Java Source Code (Java Developers Almanac Example)
can be used for simple parsing of a Java source file into tokens.
try { // Create the tokenizer to read from a file FileReader rd = new FileReader(
Tokenizing Java Source Code (Java Developers Almanac Example)
http://javaalmanac.com/egs/java.io/ParseJava.html   (70 words)

  
 Customized tokenization of domain specific text via rules corresponding to a speech recognition vocabulary (US6327561)
Method and apparatus for proofreading a document using a computer system which detects inconsistencies in style
Show 3 U.S. patent(s) that reference this one
The loading step of the inventive method can comprise: loading a speech recognition vocabulary; and, loading domain-specific tokenization rules corresponding to the speech recognition vocabulary.
http://www.delphion.com/details?pn=US06327561__   (348 words)

  
 IBM - PQ80244: URL DECODING IS TOKENIZING URLS BEFORE UNESCAPING
PQ81764: Configuring the trusted mode to determine private HTTP headers
* **************************************************************** * RECOMMENDATION: * **************************************************************** The Http Transport code is tokenizing first and then decoding, yielding a query string of name=value
Currently the appserver first unescapes the URL and then tokenizes it.
http://www.ibm.com/support/docview.wss?rs=180&uid=swg1PQ80244   (522 words)

  
 using stringstream for tokenizing a string
Here is a Tokenizer class from Codeproject, which can tokenize string separated by
#include #include #include int main() { std::string token, text("Here:is:some:text"); std::istringstream iss(text); while (getline(iss, token, ':')) { std::cout << token << std::endl; } return 0; } /* my output Here is some text */
//This is the default predicate for the Tokenize() function.
http://www.daniweb.com/techtalkforums/thread27905.html   (355 words)

  
 Incomprehensible (to me) tokenizing behavior
directory to your project and maintaining your own grammar-based tokenizer."
You need to copy StandardTokenizer.jj, change its package statement, add
http://java2.5341.com/msg/2700.html   (292 words)

  
 Parsing in c++ - dBforums
The string class (standard c++) has many functions for parsing and tokenizing.
Java has all that in standard SDK but I cannot find
http://www.dbforums.com/showthread.php?t=427063   (443 words)

  
 [Chapter 28] 28.3 The Workspace
The process of rewriting changes the tokens in the workspace:
The address to be rewritten is then tokenized and placed into that workspace.
The process of tokenizing addresses in the workspace is exactly the same as the tokenizing of rules that you saw before:
http://www.serve.com/~josh/books/tcpip/sendmail/ch28_03.htm   (131 words)

  
 kbAlertz: (815658) - This article describes how to use the XmlTextReader class to read the XML data from a file. The ...
This article describes how to use the XmlTextReader class to read the XML data from a file.
The XmlTextReader class provides direct parsing and tokenizing of the XML data.
This article describes how to do fast, tokenized stream access to the XML data instead of using an object model, such as the XML Document Object Model (DOM).
http://www.kbalertz.com/kb_815658.aspx   (1402 words)

  
 Journal - Team Oriented Tokenizing
Since I love macro wars (tongue-in-cheek), here's a first try of using just the stack for a recursion-safe tokenizer loop:
The trouble with the strtok routine is that it uses per-process data to keep track of the tokenizing: if you call some other code after the first call to strtok, you can never be sure that this code won't start a tokenizing session of its own, invaliding you sequence.
One such case are NULL-pointers for the STRING parameter; the macros don't handle these.
http://peisker.net/20041012.htm?s   (488 words)

  
 Fa-Fm
This is included here because I'm presently interested in the possibility of linking together a network of workstations as a distributed supercomputer running PVM, MPI or some other sort of software for distributed computing applications.
externally configurable language independent modules including phonesets, lexicons, letter-to-sound rules, tokenizing, part of speech tagging, intonation and duration;
It uses a token right media access control protocol.
http://stommel.tamu.edu/~baum/linuxlist/linuxlist/node17.html   (11838 words)

  
 Tokenizing a String Using PARSENAME
Author Eli Leiba brings us a way to split out portions of a string that contains tokens with a user defined function.
Read on to see how this is accomplished and the code used to perform the splitting.
Many postings and complaints about T-SQL deal with strings, but there are ways to work with it.
http://www.sqlservercentral.com/columnists/eLeiba/tokenizingastringusingparsename.asp   (125 words)

  
 Tokenizing
Easy to understand and to change, efficient, predictable.
sub tokens { my @tokens = split m{(\*\*
\d+ # Integer)}x, shift(); grep /\S/, @tokens; }
http://perl.plover.com/yak/regex/samples/slide077.html   (25 words)

  
 [No title]
Tokenizing (and parsing) should be unambiguous with the current definition.
We could define a syntax for "extensible mathematical operators" (i.e.
These would consist of characters that are illegal in normal >symbols, so there's no problem with tokenizing.
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/gwydion-1/dylan-history/1994/01/01706   (440 words)

  
 Tokenizing lines - PSPP
This tokenizer recognizes two basic types of tokens.
Any number of hex digits is read and interpreted; only the lower 8 bits are used.
The second type is an identifier or string token.
http://www.gnu.org/software/pspp/manual/html_node/Tokenizing-lines.html   (173 words)

  
 Tokenizing a token - 3
string tokenization isn't as strong as it should be.
http://www.thescripts.com/forum/post50981-3.html   (72 words)

  
 public-qt-comments@w3.org from August 2003: by date
Re: [xslt2 func/op] tokenizing "abba" to ("a","b","b","a") Tobias Reif
http://lists.w3.org/Archives/Public/public-qt-comments/2003Aug   (886 words)

Compwisdom
 About us   |  Why use us?   |  Press   |  Contact us

 Copyright © 2006 CompWisdom.com Usage implies agreement with terms.