A file can be represented in terms of meaning, keywords, all words, shingles/ sentences or contiguous piece of text. The different parameter that can be considered are set, bags, vector and distribution.
So file can be represented as vectors of words, distribution over shingles(how likely is a random file will contain this particular shingle present in it), set of keywords. These are all orthogonal dimensions.
PS: Set does not allow duplicate elements, but bag can contain duplicate elements.
-Dinu