Welcome toTencent Cloud + community, get more Tencent mass technology practice dry goods oh ~

Here are some tips for handling log file extraction. Suppose we’re looking at some Enterprise Splunk extraction. We can use Splunk to explore the data. Or we can get a simple extract and play around with the data in Python.

Running different experiments in Python seems more efficient than trying to do this exploratory operation in Splunk. Mainly because we can do whatever we want with data without limits. We can create very complex statistical models in one place.

In theory, we can explore a lot in Splunk. It has various reporting and analysis functions.

But…

Using Splunk requires the assumption that we know what we are looking for. In many cases, we don’t know what we’re looking for: we’re exploring. There may be some indication that some RESTful apis are slow, but there’s more to it than that. How do we proceed?

The first step is to get the raw data in CSV format. How to do?

Read raw data

We’ll start by wrapping a Csv.dictreader object with some additional functions.

Object-oriented purists will object to this strategy. “Why not extend DictReader?” They asked. I don’t have a good answer. I prefer functional programming and component orthogonality. For a purely object-oriented approach, we have to use a more complex mix to achieve this.

Our general framework for dealing with logging looks like this.

with open("somefile.csv") as source:
rdr = csv.DictReader(source)Copy the code

This allows us to read Splunk extracts in CSV format. We can iterate over the lines in the reader. This is trick # 1. It’s not terribly tricky, but I like it.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    for row in rdr:
        print( "{host} {ResponseTime} {source}{Service}".format_map(row) )Copy the code

We can – up to a point – report raw data in a useful format. If we want to sugarcoat the output, we can change the format string. That could be “{host: 30s} {reply time: 8s} {source: s}” or something similar.

filter

It’s very common for us to extract too much, but we only need to look at a subset. We can change the Splunk filter, but overusing filters is annoying until we’ve completed our exploration. Filtering is much easier in Python. Once we know what we need, we can do it in Splunk.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] = ='perf_log')
    for row in rdr_perf_log:
        print( "{host} {ResponseTime} {Service}".format_map(row) )Copy the code

We’ve added a generator expression to filter the source rows and be able to handle a meaningful subset.

projection

In some cases, we add additional source data columns that we don’t want to use. So we will eliminate this data by projecting each row.

In principle, Splunk never produces empty columns. However, RESTful API logging can result in a dataset containing a large number of column headings based on proxy keys that are part of the request URI. These columns will contain a row of data from a request that uses the proxy key. For the other rows, it’s useless in this column. So we have to delete these empty columns.

We could also do this with a generator expression, but it would get a little longer. Generator functions are easier to read.

def project(reader):
    for row in reader:
        yield {k:v for k,v in row.items() if v}Copy the code

We have built a new row dictionary from part of the original reader project. We can use it to wrap the output of our filter.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] = ='perf_log')
    for row in project(rdr_perf_log):
        print( "{host} {ResponseTime} {Service}".format_map(row) )Copy the code

This reduces the number of unused columns visible inside the for statement.

Symbol changes

The row[‘source’] symbol becomes cumbersome. It is better to use types.SimpleNamespace than a dictionary. This allows us to use row.source.

It’s a cool trick to create something more useful.

rdr_ns= (types.SimpleNamespace(**row) forrowinreader)Copy the code

We can fold it into a sequence of steps like this.

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] = ='perf_log')
    rdr_proj = project(rdr_perf_log)
    rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
    for row in rdr_ns:
        print( "{host} {ResponseTime} {Service}".format_map(vars(row)) )Copy the code

Notice our small change to the format_map () method. From the attributes of SimpleNamespace, we added the vars () function to extract the dictionary.

We can write it as a function with other functions to preserve syntactic symmetry.

def ns_reader(reader):
    return (types.SimpleNamespace(**row) for row in reader)Copy the code

Indeed, we can write it as a lambda structure that works like a function

ns_reader = lambda reader: (types.SimpleNamespace(**row) for row in reader)Copy the code

While the ns_reader () function and ns_reader () lambda are used in the same way, writing docstrings and doctest unit tests for lambda is slightly more difficult. For this reason, lambda constructs should be avoided.

We can use map (lambda Row: types.simplenamespace (** row), reader). Some people like this generator expression.

We could use a proper for statement and an internal yield statement, but writing a big statement out of a small one doesn’t seem to do much good.

We have many choices because Python offers so many functional programming capabilities. Although we don’t often think of Python as a functional language. But there are several ways to handle simple mappings.

Map: Transform and derive data

We often have a very obvious list of data transformations. In addition, we will have a growing list of derived data items. The spin-off project will be dynamic and based on different assumptions that we are testing. Whenever we have an experiment or problem, we may change the derived data.

Each of these steps: filtering, projection, transformation, and derivation are phases of the “map” part of the Map-Reduce pipeline. We can create smaller functions and apply them to map (). Because we are updating a stateful object, we cannot use the normal map () function. If we wanted to implement a purer functional programming style, we would use an immutable Namedtuple instead of a mutable SimpleNamespace.

def convert(reader):
    for row in reader:
        row._time = datetime.datetime.strptime(row.Time, "%Y-%m-%dT%H:%M:%S.%F%Z")
        row.response_time = float(row.ResponseTime)
        yield rowCopy the code

As we explore, we’ll tweak the body of the transformation function. Maybe we’ll start with some minimal conversions and derivations. We’re going to use some questions like “Are these correct? To continue to explore. When we find it doesn’t work, we take some out of it.

Our overall processing process is as follows:

with open("somefile.csv") as source:
    rdr = csv.DictReader(source)
    rdr_perf_log = (row for row in rdr if row['source'] = ='perf_log')
    rdr_proj = project(rdr_perf_log)
    rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
    rdr_converted = convert(rdr_ns)
    for row in rdr_converted:
        row.start_time = row._time - datetime.timedelta(seconds=row.response_time)
        row.service = some_mapping(row.Service)
        print( "{host: 30 S} {start_time: % H: % M: % S} {response_time: 6.3 f} {service}".format_map(vars(row)) )Copy the code

Notice the change in the statement body. The convert () function produces the value we determined. We’ve added some extra variables to the for loop that we’re not 100% sure about. We’ll see if they are useful (or even correct) before updating the convert () function.

decrement

In terms of reduction, we can take a slightly different processing approach. We need to refactor our previous example and turn it into a generator function.

def converted_log(some_file):
    with open(some_file) as source:
        rdr = csv.DictReader(source)
        rdr_perf_log = (row for row in rdr if row['source'] = ='perf_log')
        rdr_proj = project(rdr_perf_log)
        rdr_ns = (types.SimpleNamespace(**row) for row in rdr_proj)
        rdr_converted = convert(rdr_ns)
        for row inrdr_converted: row.start_time = row._time - datetime.timedelta(seconds=row.response_time) row.service = some_mapping(row.Service) yield  rowCopy the code

Then replace print () with a yield.

This is another part of refactoring.

for row in converted_log("somefile.csv") :print( "{host: 30 S} {start_time: % H: % M: % S} {response_time: 6.3 f} {service}".format_map(vars(row)) )Copy the code

Ideally, all of our programming would look like this. We use generator functions to generate data. The final display of the data remains completely separate. This gives us more freedom to refactor and change processing.

Now we can do things like collect rows into the Counter () object, or maybe calculate some statistics. We can group rows by service using DefaultDict (list).

by_service= defaultdict(list)
for row in converted_log("somefile.csv"):
    by_service[row.service] = row.response_time
for svc in sorted(by_service):
    m = statistics.mean( by_service[svc] )
    print( "{svc:15s} {m:.2f}".format_map(vars()) )Copy the code

We decided to create a concrete list object here. We can use IterTools to group response times by service. It looks like proper functional programming, but this implementation points out some limitations in the Pythonic form of functional programming. Either we have to sort the data (create a list object) or create a list when grouping the data. In order to do several different statistics, it is often easier to group data by creating concrete lists.

Instead of simply printing row objects, we’re doing two things right now.

Create local variables, such as SVC and m. We can easily add changes or other measures.

Use the vars () function without arguments, which creates a dictionary from local variables.

This behavior using vars () without arguments is a handy trick like locals (). It allows us to simply create any local variables we want and include them in the formatted output. We can hack into all kinds of statistical methods that we think might be relevant.

Since our basic processing loop is for lines in converted_log (” somefile.csv “), we can explore many processing options with a small, easily modified script. We can explore some assumptions to determine why some RESTful apis are slow and others are fast.

Question and answer


How do I analyze memory usage in Python?


reading


Data analysis of wechat friends based on Python


Python Data analysis and data Mining Learning roadmap


Pandas is the Python data analysis library

Has been authorized by the author tencent cloud + community release, the original link: https://cloud.tencent.com/developer/article/1007247?fromSource=waitui