Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions INSTALLING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Installing Topfew

Each Topfew [release](https://github.com/timbray/topfew/releases) comes with binaries built for both the x86 and ARM
flavors of Linux, MacOS, and Windows.

Topfew comes with a Makefile which is uncomplicated. Typing `make` will create an executable named `tf`,
created by `go build` with no options, in the `./bin` directory.

## Arch Linux

Topfew [is available](https://aur.archlinux.org/packages/topfew) in the
[Arch User Repository](https://wiki.archlinux.org/title/Arch_User_Repository) (AUR).
If you have an AUR pacman wrapper installed you can install it directly. Otherwise, to install Topfew as an Arch package:
```
git clone https://aur.archlinux.org/topfew.git
cd topfew
makepkg -i
```
37 changes: 37 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,13 @@ If no fieldlist is provided, **tf** treats the whole input record as a single fi
Provides a regular expression that is used as a field separator instead of the default white space.
This is likely to incur a significant performance cost.

`-q, --quotedfields`

Some files, for example Apache httpd logs, use space-separation but also
allow spaces within fields which are delimited by `"`. The -q/--quotedfields
argument allows **tf** to process these correctly. It is an error to specify both
-p and -q.

`-g regexp`, `--grep regexp`

The initial **g** suggests `grep`.
Expand Down Expand Up @@ -114,6 +121,36 @@ Records are separated by newlines, fields within records by white space, defined

The field separator can be overridden with the --fieldseparator option.

## Case study: Apache access_log

Here is a line from an Apache httpd `access_log` file. For readability, the fields are
separated by line-breaks and numbered. Note that the fields are mostly space-separated, but that field 6,
summarizing the request and its result, is delimited by quote characters `"`.

```
1. 202.113.19.244
2. -
3. -
4. [12/Mar/2007:08:04:39
5. -0800]
6. "GET /ongoing/picInfo.xml?o=http://www.tbray.org/ongoing/When/200x/2007/03/10/Beautiful-Code HTTP/1.1"
7. 200
8. 137
9. "http://www.tbray.org/ongoing/When/200x/2007/03/10/Beautiful-Code"
10. "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2"
```

The fetch of `picInfo.xml` signals that this is an actual browser request, likely signifying that
a human was involved; the URL following the `o=` is the resource the human looked at. Here is a
**tf** invocation that yields a list of the top 5 URLs that were fetched by a human:

```shell
tf -g picInfo.xml -f 6 -q -s '\?utm.*' '' -s " HTTP/..." "" -s "GET .*\/ongoing" ""
```

Note the `-g` to select only lines with `picInfo.xml`, the `-q` to request correct processing
of quote-delimited fields, and the sequence of `-s` patterns to clean up the results.

## Performance issues

Since the effect of topfew can be exactly duplicated with a combination of `awk`, `grep`, `sed` and `sort`, you wouldn’t be using it if you didn’t care about performance.
Expand Down
14 changes: 13 additions & 1 deletion internal/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ type config struct {
filter filters
width int
sample bool
quotedFields bool
}

func Configure(args []string) (*config, error) {
Expand Down Expand Up @@ -75,6 +76,8 @@ func Configure(args []string) (*config, error) {
}
case arg == "--sample":
config.sample = true
case arg == "--quotedfields" || arg == "-q":
config.quotedFields = true
case arg == "-h" || arg == "-help" || arg == "--help":
fmt.Println(instructions)
os.Exit(0)
Expand All @@ -101,6 +104,9 @@ func Configure(args []string) (*config, error) {
}
i++
}
if (config.fieldSeparator != nil) && config.quotedFields {
err = errors.New("only one of -p/--fieldseparator and -q/--quotedfields may be specified")
}

return &config, err
}
Expand Down Expand Up @@ -132,7 +138,8 @@ order of occurrences.
Usage: tf
-n, --number (output line count) [default is 10]
-f, --fields (field list) [default is the whole record]
-p, --fieldseparator (field separator regex) [default is white space]
-p, --fieldseparator (field separator regex) [default is white space]
-q, --quotedfields [default is false]
-g, --grep (regexp) [may repeat, default is accept all]
-v, --vgrep (regexp) [may repeat, default is reject none]
-s, --sed (regexp) (replacement) [may repeat, default is no changes]
Expand All @@ -151,6 +158,11 @@ Fields are separated by white space (spaces or tabs) by default.
This can be overridden with the --fieldseparator option, at some cost in
performance.

Some files, for example Apache httpd logs, use space-separation but also
allow spaces within fields which are quoted with ("). The -q/--quotedfields
allows tf to process these correctly. It is an error to specify both
-p and -q.

The regexp-valued fields work as follows:
-g/--grep discards records that don't match the regexp (g for grep)
-v/--vgrep discards records that do match the regexp (v for grep -v)
Expand Down
2 changes: 2 additions & 0 deletions internal/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,12 @@ func TestArgSyntax(t *testing.T) {
{"--sed"}, {"-s", "x"}, {"--sample", "--sed", "1"},
{"--width", "a"}, {"-w", "0"}, {"--sample", "-w"},
{"--sample", "-p"}, {"--fieldseparator", "a["},
{"--fieldseparator", "x", "-q"}, {"--quotedfields", "-f", "z"},
}

// not testing -h/--help because it'd be extra work to avoid printing out the usage
goods := [][]string{
{"-q", "fname"}, {"--quotedfields"},
{"--number", "1"}, {"-n", "5"},
{"--fields", "1"}, {"-f", "3,5"},
{"--grep", "re1"}, {"-g", "re2"},
Expand Down
185 changes: 149 additions & 36 deletions internal/keyfinder.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,78 +21,127 @@ const NER = "not enough bytes in record"
// does mean that the contents of the field are only valid until you call getKey again, and also that
// the keyFinder type is not thread-safe
type keyFinder struct {
fields []uint
key []byte
separator *regexp.Regexp
fields []uint
key []byte
separator *regexp.Regexp
quotedFields bool
}

// newKeyFinder creates a new Key finder with the supplied field numbers, the input should be 1 based.
// keyFinder is not thread-safe, you should clone it for each goroutine that uses it.
func newKeyFinder(keys []uint, separator *regexp.Regexp) *keyFinder {
func newKeyFinder(keys []uint, separator *regexp.Regexp, quotedFields bool) *keyFinder {
kf := keyFinder{
key: make([]byte, 0, 128),
}
for _, knum := range keys {
kf.fields = append(kf.fields, knum-1)
}
kf.separator = separator
kf.quotedFields = quotedFields
return &kf
}

// clone returns a new keyFinder with the same configuration. Each goroutine should use its own
// keyFinder instance.
func (kf *keyFinder) clone() *keyFinder {
return &keyFinder{
fields: kf.fields,
key: make([]byte, 0, 128),
separator: kf.separator,
fields: kf.fields,
key: make([]byte, 0, 128),
separator: kf.separator,
quotedFields: kf.quotedFields,
}
}

// getKey extracts a key from the supplied record. This is applied to every record,
// so efficiency matters.
func (kf *keyFinder) getKey(record []byte) ([]byte, error) {
// if there are no Key-finders just return the record, minus any trailing newlines
// chomp
if record[len(record)-1] == '\n' {
record = record[:len(record)-1]
}
// if there are no Key-finders the key is the record
if len(kf.fields) == 0 {
if record[len(record)-1] == '\n' {
record = record[0 : len(record)-1]
}
return record, nil
}
var err error
kf.key = kf.key[:0]
if kf.separator == nil {
field := 0
index := 0
first := true

// for each field in the Key
for _, keyField := range kf.fields {
// bypass fields before the one we want
for field < int(keyField) {
index, err = pass(record, index)
// no regex provided, we're doing space-separation
if kf.quotedFields {
// if we're doing apache httpd style access_log files, with some "-quoted fields
field := 0
index := 0
first := true

// for each field in the key
for _, keyField := range kf.fields {
// bypass fields before the one we want
for field < int(keyField) {
index, err = passQuoted(record, index)
if err != nil {
return nil, err
}
// in the special case where we might have just passed a quoted fields, we will
// advance index past the closing quote
if index < len(record) && record[index] == '"' {
index++
}
field++
}

// join(' ', kf)
if first {
first = false
} else {
kf.key = append(kf.key, ' ')
}

kf.key, index, err = gatherQuoted(kf.key, record, index)
if err != nil {
return nil, err
}
// in the special case where we might have just passed a quoted fields, we will
// advance index past the closing quote
if index < len(record) && record[index] == '"' {
index++
}
field++
}
} else {
// basic space-separation
field := 0
index := 0
first := true

// join(' ', kf)
if first {
first = false
} else {
kf.key = append(kf.key, ' ')
}
// for each field in the Key
for _, keyField := range kf.fields {
// bypass fields before the one we want
for field < int(keyField) {
index, err = pass(record, index)
if err != nil {
return nil, err
}
field++
}

// attach desired field to Key
kf.key, index, err = gather(kf.key, record, index)
if err != nil {
return nil, err
}
// join(' ', kf)
if first {
first = false
} else {
kf.key = append(kf.key, ' ')
}

field++
// attach desired field to Key
kf.key, index, err = gather(kf.key, record, index)
if err != nil {
return nil, err
}

field++
}
}
} else {
// regex separator provided, less code but probably slower
allFields := kf.separator.Split(string(record), -1)
for i, field := range kf.fields {
if int(field) >= len(allFields) {
Expand All @@ -107,9 +156,10 @@ func (kf *keyFinder) getKey(record []byte) ([]byte, error) {
return kf.key, err
}

// pull in the bytes from a desired field
// gather pulls in the bytes from a desired field, and leaves index positioned at the first white-space
// character following the field, or at the end of the record, i.e. len(record)
func gather(key []byte, record []byte, index int) ([]byte, int, error) {
// eat leading space
// eat leading space - if we're already at the end of the record, the loop is a no-op
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
index++
}
Expand All @@ -118,13 +168,49 @@ func gather(key []byte, record []byte, index int) ([]byte, int, error) {
}

// copy Key bytes
for index < len(record) && record[index] != ' ' && record[index] != '\t' && record[index] != '\n' {
key = append(key, record[index])
startAt := index
for index < len(record) && record[index] != ' ' && record[index] != '\t' {
index++
}
key = append(key, record[startAt:index]...)
return key, index, nil
}

// same semantics as gather, but respects quoted fields that might create spaces. Leaves the index
// value pointing at the closing quote
func gatherQuoted(key []byte, record []byte, index int) ([]byte, int, error) {
// eat leading space
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
index++
}
if index >= len(record) {
return nil, 0, errors.New(NER)
}

if record[index] == '"' {
index++
startAt := index
for index < len(record) && record[index] != '"' {
index++
}
key = append(key, record[startAt:index]...)
// if we hit end-of-record before the closing quote, that's an error
if index == len(record) {
return nil, 0, errors.New(NER)
}
} else {
startAt := index
for index < len(record) && record[index] != ' ' && record[index] != '\t' {
index++
}
key = append(key, record[startAt:index]...)
}
return key, index, nil
}

// pass moves the index variable past any white space and a space-separated field,
// leaving index pointing at the first white-space character after the field or
// at the end of record, i.e. == len(record)
func pass(record []byte, index int) (int, error) {
// eat leading space
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
Expand All @@ -138,3 +224,30 @@ func pass(record []byte, index int) (int, error) {
}
return index, nil
}

// same semantics as pass, but for quoted fields. Leaves the index value pointing at the
// closing "
func passQuoted(record []byte, index int) (int, error) {
// eat leading space
for index < len(record) && (record[index] == ' ' || record[index] == '\t') {
index++
}
if index == len(record) {
return 0, errors.New(NER)
}
if record[index] == '"' {
index++
for index < len(record) && record[index] != '"' {
index++
}
// if we hit end of record before the closing quote, that's a bug
if index >= len(record) {
return 0, errors.New(NER)
}
} else {
for index < len(record) && record[index] != ' ' && record[index] != '\t' {
index++
}
}
return index, nil
}
Loading