|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectweka.filters.Filter
weka.filters.unsupervised.attribute.StringToWordVector
public class StringToWordVector
Converts String attributes into a set of attributes representing word occurrence information from the text contained in the strings. The set of words (attributes) is determined by the first batch filtered (typically training data).
| Constructor Summary | |
|---|---|
StringToWordVector()
Default constructor. |
|
StringToWordVector(int wordsToKeep)
Constructor that allows specification of the target number of words in the output. |
|
| Method Summary | |
|---|---|
java.lang.String |
attributeNamePrefixTipText()
Returns the tip text for this property |
boolean |
batchFinished()
Signify that this batch of input to the filter is finished. |
java.lang.String |
delimitersTipText()
Returns the tip text for this property |
java.lang.String |
getAttributeNamePrefix()
Get the attribute name prefix. |
java.lang.String |
getDelimiters()
Get the value of delimiters. |
boolean |
getIDFTransform()
Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
getLowerCaseTokens()
Gets whether if the tokens are to be downcased or not. |
boolean |
getNormalizeDocLength()
Gets whether if the word frequencies for a document (instance) should be normalized or not. |
boolean |
getOnlyAlphabeticTokens()
Gets whether if the tokens are to be formed only from contiguous alphabetic sequences. |
java.lang.String[] |
getOptions()
Gets the current settings of the filter. |
boolean |
getOutputWordCounts()
Gets whether output instances contain 0 or 1 indicating word presence, or word counts. |
Range |
getSelectedRange()
Get the value of m_SelectedRange. |
boolean |
getTFTransform()
Gets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j. |
boolean |
getUseStoplist()
Gets whether if the words on the stoplist are to be ignored (The stoplist is in weka.core.StopWords). |
int |
getWordsToKeep()
Gets the number of words (per class if there is a class attribute assigned) to attempt to keep. |
java.lang.String |
globalInfo()
Returns a string describing this filter |
java.lang.String |
IDFTransformTipText()
Returns the tip text for this property |
boolean |
input(Instance instance)
Input an instance for filtering. |
java.util.Enumeration |
listOptions()
Returns an enumeration describing the available options |
java.lang.String |
lowerCaseTokensTipText()
Returns the tip text for this property. |
static void |
main(java.lang.String[] argv)
Main method for testing this class. |
java.lang.String |
normalizeDocLengthTipText()
Returns the tip text for this property |
java.lang.String |
onlyAlphabeticTokensTipText()
Returns the tip text for this property. |
java.lang.String |
outputWordCountsTipText()
Returns the tip text for this property |
void |
setAttributeNamePrefix(java.lang.String newPrefix)
Set the attribute name prefix. |
void |
setDelimiters(java.lang.String newDelimiters)
Set the value of delimiters. |
void |
setIDFTransform(boolean IDFTransform)
Sets whether if the word frequencies in a document should be transformed into: fij*log(num of Docs/num of Docs with word i) where fij is the frequency of word i in document(instance) j. |
boolean |
setInputFormat(Instances instanceInfo)
Sets the format of the input instances. |
void |
setLowerCaseTokens(boolean downCaseTokens)
Sets whether if the tokens are to be downcased or not. |
void |
setNormalizeDocLength(boolean normalizeDocLength)
Sets whether if the word frequencies for a document (instance) should be normalized or not. |
void |
setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
Sets whether if tokens are to be formed only from contiguous alphabetic character sequences. |
void |
setOptions(java.lang.String[] options)
Parses a given list of options controlling the behaviour of this object. |
void |
setOutputWordCounts(boolean outputWordCounts)
Sets whether output instances contain 0 or 1 indicating word presence, or word counts. |
void |
setSelectedRange(java.lang.String newSelectedRange)
Set the value of m_SelectedRange. |
void |
setTFTransform(boolean TFTransform)
Sets whether if the word frequencies should be transformed into log(1+fij) where fij is the frequency of word i in document(instance) j. |
void |
setUseStoplist(boolean useStoplist)
Sets whether if the words that are on a stoplist are to be ignored (The stop list is in weka.core.StopWords). |
void |
setWordsToKeep(int newWordsToKeep)
Sets the number of words (per class if there is a class attribute assigned) to attempt to keep. |
java.lang.String |
TFTransformTipText()
Returns the tip text for this property |
java.lang.String |
useStoplistTipText()
Returns the tip text for this property. |
java.lang.String |
wordsToKeepTipText()
Returns the tip text for this property |
| Methods inherited from class weka.filters.Filter |
|---|
batchFilterFile, filterFile, getOutputFormat, inputFormat, isOutputFormatDefined, numPendingOutput, output, outputFormat, outputPeek, useFilter |
| Methods inherited from class java.lang.Object |
|---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public StringToWordVector()
public StringToWordVector(int wordsToKeep)
wordsToKeep - the number of words in the output vector (per class
if assigned).| Method Detail |
|---|
public java.util.Enumeration listOptions()
listOptions in interface OptionHandler
public void setOptions(java.lang.String[] options)
throws java.lang.Exception
-C
Output word counts rather than boolean word presence.
-D delimiter_charcters
Specify set of delimiter characters
(default: " \n\t.,:'\\\"()?!\"
-R index1,index2-index4,...
Specify list of string attributes to convert to words.
(default: all string attributes)
-P attribute_name_prefix
Specify a prefix for the created attribute names.
(default: "")
-W number_of_words_to_keep
Specify number of word fields to create.
Other, less useful words will be discarded.
(default: 1000)
-A
Only tokenize contiguous alphabetic sequences.
-L
Convert all tokens to lower case before adding to the dictionary.
-S
Do not add words to the dictionary which are on the stop list.
-T
Transform word frequencies to log(1+fij) where fij is frequency of word i
in document j.
-I
Transform word frequencies to fij*log(numOfDocs/numOfDocsWithWordi)
where fij is frequency of word i in document j.
-N
Normalize word frequencies for each document(instance). The frequencies
are normalized to average length of the documents specified in input
format.
setOptions in interface OptionHandleroptions - the list of options as an array of strings
java.lang.Exception - if an option is not supportedpublic java.lang.String[] getOptions()
getOptions in interface OptionHandler
public boolean setInputFormat(Instances instanceInfo)
throws java.lang.Exception
setInputFormat in class FilterinstanceInfo - an Instances object containing the input
instance structure (any instances contained in the object are
ignored - only the structure is required).
java.lang.Exception - if the input format can't be set
successfully
public boolean input(Instance instance)
throws java.lang.Exception
input in class Filterinstance - the input instance.
java.lang.IllegalStateException - if no input structure has been defined.
java.lang.NullPointerException - if the input format has not been
defined.
java.lang.Exception - if the input instance was not of the correct
format or if there was a problem with the filtering.
public boolean batchFinished()
throws java.lang.Exception
batchFinished in class Filterjava.lang.IllegalStateException - if no input structure has been defined.
java.lang.NullPointerException - if no input structure has been defined,
java.lang.Exception - if there was a problem finishing the batch.public java.lang.String globalInfo()
public boolean getOutputWordCounts()
public void setOutputWordCounts(boolean outputWordCounts)
outputWordCounts - true if word counts should be output.public java.lang.String outputWordCountsTipText()
public java.lang.String getDelimiters()
public void setDelimiters(java.lang.String newDelimiters)
newdelimiters - Value to assign to delimiters.public java.lang.String delimitersTipText()
public Range getSelectedRange()
public void setSelectedRange(java.lang.String newSelectedRange)
newSelectedRange - Value to assign to m_SelectedRange.public java.lang.String getAttributeNamePrefix()
public void setAttributeNamePrefix(java.lang.String newPrefix)
newPrefix - String to use as the attribute name prefix.public java.lang.String attributeNamePrefixTipText()
public int getWordsToKeep()
public void setWordsToKeep(int newWordsToKeep)
newWordsToKeep - the target number of words in the output
vector (per class if assigned).public java.lang.String wordsToKeepTipText()
public boolean getTFTransform()
public void setTFTransform(boolean TFTransform)
true - if word frequencies are to be transformed.public java.lang.String TFTransformTipText()
public boolean getIDFTransform()
public void setIDFTransform(boolean IDFTransform)
true - if the word frequecies are to be transformedpublic java.lang.String IDFTransformTipText()
public boolean getNormalizeDocLength()
public void setNormalizeDocLength(boolean normalizeDocLength)
true - if word frequencies are to be normalized.public java.lang.String normalizeDocLengthTipText()
public boolean getOnlyAlphabeticTokens()
public void setOnlyAlphabeticTokens(boolean tokenizeOnlyAlphabeticSequences)
onlyAlphabeticSequences - should be set to true if only alphabetic
tokens should be formed.public java.lang.String onlyAlphabeticTokensTipText()
public boolean getLowerCaseTokens()
public void setLowerCaseTokens(boolean downCaseTokens)
downCaseTokens - should be true if only lower case tokens are
to be formed.public java.lang.String lowerCaseTokensTipText()
public boolean getUseStoplist()
public void setUseStoplist(boolean useStoplist)
useStoplist - true if the tokens that are on a stoplist are to be
ignored.public java.lang.String useStoplistTipText()
public static void main(java.lang.String[] argv)
argv - should contain arguments to the filter:
use -h for help
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||