Practical tips for Python libraries

Column catalog

collections
- Perform quantitative statistics on list data elements
- Perform TOP statistics on list data elements
- Statistical subtraction of list data elements
- Perform statistical calculations on list data elements
defaultdict
- Obtaining the Dict key does not exist
deque
- Specifies the length of the ListQueue
namedtuple
- Tuple specifies the index object
pandas + numpy
- Read and write file data
- Read_csv Parameter description

Column details

collections

The official document: https://docs.python.org/2/library/collections.html#collections.Counter

Perform quantitative statistics on list data elements

from collections import Counter

test_list = [1.2.3.3.2.1.1.1.2.2.3.1.2.1.1]

counter = Counter(test_list)
Return value: Counter({1: 7, 2: 5, 3: 3})

value = counter[2]
# return value: 5

In fact, the above introduction is equivalent to the following code:
counter = {i: test_list.count(i) for i in set(test_list)}
{1: 7, 2: 5, 3: 3}
Copy the code

Perform TOP statistics on list data elements

from collections import Counter

test_list = [1.2.3.3.2.1.1.1.2.2.3.1.2.1.1]

counter = Counter(lst)
result = counter.most_common(2)  # statistical ranked by
# return value: [(1, 7), (2, 5)]
Copy the code

Statistical subtraction of list data elements

from collections import Counter

test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)
test1.subtract(test2)
# return value:
# test1: Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6})
# test2: Counter({'d': 4, 'c': 3, 'b': 2, 'a': 1})
Copy the code

Perform statistical calculations on list data elements

from collections import Counter

test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)

result1 = test1 + test2  # counter add: add the same Key, keep different keys
result2 = test1 - test2  # counter subtraction: if the value of the same Key is 0, the value of different keys is 0
result3 = test1 & test2  # counter intersection: take the same key, small value
result4 = test1 | test2  # counter union: take all keys, and value is larger if the key is the same
# return value:
# result1: Counter({'a': 5, 'b': 4, 'e': 4, 'c': 3, 'd': 2})
# result2: Counter({'a': 3})
# result3: Counter({'b': 2, 'a': 1})
# result4: Counter({'a': 4, 'd': 4, 'e': 4, 'c': 3, 'b': 2})
Copy the code

defaultdict

The official document: https://docs.python.org/2/library/collections.html#collections.defaultdict

Obtaining the Dict key does not exist

from collections import defaultdict

test = defaultdict(str)
test['key1'] = '1'
test['key2'] = '2'
Fetching a nonexistent Key will use an empty object corresponding to the instantiated type as initialization data
# str -> "" | int -> 0 | list -> list() | dict -> dict() | set -> set() | tuple -> tuple() 
v = test['medusa']

# return value:
# v: ""
# test: defaultdict(<class 'str'>, {'key1': '1', 'key2': '2', 'medusa': ''})
Copy the code

deque

The official document: https://docs.python.org/2/library/collections.html#collections.deque

Specifies the length of the ListQueue

# First In - First Out, FIFO
from collections import deque

my_queue = deque(maxlen=10)

for i in range(10):
    my_queue.append(i+1)

print(my_queue)
Output: # deque ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], maxlen = 10)

print(my_queue.popleft())
Output: 1

for i in range(5):
    my_queue.append(i+1)
print(my_qeueu)
Output: # deque ([6, 7, 8, 9, 10, 1, 2, 3, 4, 5], maxlen = 10)
Copy the code

namedtuple

The official document: https://docs.python.org/2/library/collections.html#collections.namedtuple

Tuple specifies the index object

from collections import namedtuple

# to create data template, the name of the Person, the data template domain name name | description | forever | size
Person = namedtuple('Person'.'name description forever size')

Create two index objects using a template. Both methods are equivalent
Medusa = Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
You = Person._make(['You'.'... '.True.'Max'])

print(Medusa)
print(You)
# output:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# You: Person(name='You', description='... ', forever=True, size='Max')

Changing the value of the property actually generates a new object
update_Medusa = Medusa._replace(description='https://juejin.cn/user/2805609406139950')
print(Medusa)
print(update_Medusa)
# output:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# update_Medusa: Person(name='Medusa', description='https://juejin.cn/user/2805609406139950', forever=True, size='Max')

# output dictionary
print(Medusa._asdict())
Output: # OrderedDict ([(' name ', 'Medusa'), (' description ', 'Medusa blog), (" forever ", True), (' size', 'Max')])
Copy the code

pandas + numpy

Official document: https://www.pypandas.cn/

Read and write file data

import pandas as pd

df = pd.DataFrame(pd.read_csv('csv_name.csv',header=1)) 
df = pd.DataFrame(pd.read_excel('xlsx_name.xlsx'))
Copy the code

Read	Write
read_csv	to_csv
read_excel	to_excel
read_hdf	to_hdf
read_sql	to_sql
read_json	to_json
read_msgpack(experimental)	to_msgpack(experimental)
read_html	to_html
read_gbq (experimental)	to_gbq (experimental)
read_stata	to_stata
read_sas	–
read_clipboard	to_clipboard
read_pickle	to_pickle

Read_csv Parameter description

filepath_or_buffer: strSupports strings or any readable file object, including UEL files
sep: strSpecifies a data separator, the default is to try “,” separator, separator is longer than one character and is not “\s+”, will use Python’s parser, and ignores commas in data
delimiter: strDelimiter, alternative delimiter. If this parameter is specified, the sep parameter is invalid
delim_whitespace: boolSpecifying whether Spaces are used as separators is equivalent to setting sep=”\s+”. If this parameter is set to “True”, the delimiter parameter is invalid
header: int or list of intsIf header=0 is set explicitly, existing column names will be replaced. If list is set explicitly, these rows in the file will be used as column headers (meaning multiple headers per column). Intermediate rows will be ignored. If skip_blank_lines=True, the header argument ignores comment lines and empty lines, so header=0 represents the first line of data rather than the first line of the file
names: array likeThe list of column names used for the result. If there are no column header rows in the data file, header=None is required. The default list cannot be duplicated unless mangle_dupe_cols=True is set
index_col : int or sequence or FalseThe column number or column name used for the row index. If pandas is given a sequence with multiple row indexes, index_col=False can be set to disable the first column for the row index if the file is irregular and the delimiter is at the end of the row
usecols: array-likeReturns a subset of data whose values must correspond to locations in a file (numbers can correspond to specified columns) or characters to column names in a file, such as: Usecols valid arguments can be either [0,1,2] or [‘ foo ‘, ‘bar’, ‘baz’], which can be used to speed up loading and reduce memory consumption.
as_recarray: boolNot supported: this parameter will be removed in future versions, use pd.read_csv(…). To_records () instead, returns a recarray of Numpy in place of DataFrame. If this parameter is set to True, the squeeze parameter will be used in preference, and the row index will no longer be available, and the index column will be ignored
squeeze: boolIf the file value contains a column, return a Series
prefix: strPrefix a column when there is no column title
mangle_dupe_cols : boolDuplicate columns, representing multiple duplicate columns as “x.0 “…” X.N”, otherwise the column will be overwritten.
dtype: Type name or dict of column -> typeThe data type of each column
engine: "c" or "python"Specify the analysis engine, C engine is faster, but Python engine is more complete
converters: dictA dictionary of column conversion functions. keys can be column names or column ordinals
true_values: list Values to consider as True
false_values: list, Values to consider as False
skipinitialspace: boolWhite space after the delimiter is ignored
skiprows: list-like or integerThe number of lines to ignore (from the beginning of the file), or the list of line numbers to skip
skipfooter: intIgnore from the end of the file
skip_footer: intIgnore from the end of the file (not recommended)
nrows: intNumber of lines to read (from the start of the file header)
na_values: scalar, str, list-like, or dictA set of values used to replace NA/NaN, which, if passed, need to specify null values for specific columns. Default to “1. # IND”, “1 # QNAN”, “N/A”, “NA”, “NULL”, “NaN”, “NaN”
keep_default_na: boolIf the na_values argument is specified and keep_default_na=False, the default NaN is overridden, otherwise added
na_filter: boolWhether to check for missing values (empty strings or null values). For large files, there are no null values in the data set. Setting na_filter=False can speed up reading
verbose: boolWhether to print the output of various parsers
skip_blank_lines: boolIf True, blank lines are skipped, otherwise NaN is denoted
parse_dates: boolean or list of ints or names or list of lists or dict
- Passing True will parse the index
- Passing a list of ints or names(for example [1, 2,3]) will parse the values of columns 1,2, and 3 as separate date columns
- Passing a list of lists(e.g. [[1, 3]]) will merge the 1,3 columns as a single date column
- Passing dict(for example {“foo”: [1, 3]}) merges columns 1,3 and names the merged column “foo”
infer_datetime_format: boolIf set to True and parse_dates is available, pandas will attempt to convert to the date type and, if it can, convert the method and parse. In some cases five to ten times faster
keep_date_col: boolIf joining multiple columns resolves dates, the participating columns are kept
date_parser: functionThe function used to parse dates. By default, dateutil.parser.parser is used. Pandas tries three different ways of parsing and uses the next method if it encounters problems
- Use one or more arrays(specified by parse_dates) as arguments
- The join specifies a multi-column string as a single column as an argument
- The date_parser function is called once per line to parse one or more strings (specified by parse_dates) as arguments
dayfirst: boolDate type in DD/MM format
iterator: boolReturns a TextFileReader object to process the file block by block
chunksize: intSize of the file block
compression: "infer" or "gzip" or "bz2" or "zip" or "xz" or NoneDirectly use the compressed files on disk. If the infer parameter is used, the files with the specified suffix will be decompressed in a specified manner
thousands: strThousandth separator
decimal: strA decimal point in a character
float_precision: str Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter
lineterminator : strLine separator, used only under the C parser
quotechar: strQuotation marks, used to identify beginning and explanation characters. Delimiters within quotation marks are ignored
quoting: int or csv.QUOTE_* instanceControl quotation marks in CSV constants with options from QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote: boolDouble quotation marks, when single quotation marks have been defined, and the quoting parameter is not QUOTE_NONE, double quotation marks are used to indicate that the element within the quotation is used as an element
escapechar: strWhen quoting QUOTE_NONE, the use of a character is specified without the delimiter limit
comment: strIf the character appears at the beginning of the line, the line will be ignored entirely. This argument can only be a single character. Empty lines (just as skip_blank_lines=True) comment lines are ignored by header and Skiprows. If comment=’#’ parse “#empty\na,b,c\n1,2,3” with header=0 then the result will be returned with ‘a,b,c’ as header
encoding: strSpecifies the character set type, usually ‘UTF-8’
dialect: str or csv.Dialect instanceIf no specific language is specified, sep is ignored if it is larger than one character
tupleize_cols: bool Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)
error_bad_lines: boolIf a row contains too many columns, the DataFrame is not returned by default, and if set to False, the row changes are culled (only with the C parser).
warn_bad_lines: boolIf error_bad_lines=False and warn_bad_lines=True then all “bad lines” will be printed (only for C parsers)
low_memory: boolSet to False or specify the type using the dtype argument. Note that chunksize or iterator blocks read the entire file into a Dataframe. Ignore the type (only valid in the C parser)
buffer_lines: int– This parameter will be removed in future releases as its value is not recommended in the parser (not recommended)
compact_ints: boolThis parameter will be removed in future versions (not recommended). If compact_ints=True, then any columns with integer types will be stored as the smallest integer type, depending on the use_unsigned argument
use_unsigned: boolThis parameter will be removed in future releases (not recommended) and specifies whether the compressed integer column is signed or unsigned if it is compressed (i.e. compact_ints=True)
memory_map: boolIf the file in use is in memory, use the map file directly. In this way, you can avoid IO operations on files again