Column catalog

  • collections
    • Perform quantitative statistics on list data elements
    • Perform TOP statistics on list data elements
    • Statistical subtraction of list data elements
    • Perform statistical calculations on list data elements
  • defaultdict
    • Obtaining the Dict key does not exist
  • deque
    • Specifies the length of the ListQueue
  • namedtuple
    • Tuple specifies the index object
  • pandas + numpy
    • Read and write file data
    • Read_csv Parameter description

Column details

collections

The official document: https://docs.python.org/2/library/collections.html#collections.Counter

Perform quantitative statistics on list data elements

from collections import Counter

test_list = [1.2.3.3.2.1.1.1.2.2.3.1.2.1.1]

counter = Counter(test_list)
Return value: Counter({1: 7, 2: 5, 3: 3})

value = counter[2]
# return value: 5

In fact, the above introduction is equivalent to the following code:
counter = {i: test_list.count(i) for i in set(test_list)}
{1: 7, 2: 5, 3: 3}
Copy the code

Perform TOP statistics on list data elements

from collections import Counter

test_list = [1.2.3.3.2.1.1.1.2.2.3.1.2.1.1]

counter = Counter(lst)
result = counter.most_common(2)  # statistical ranked by
# return value: [(1, 7), (2, 5)]
Copy the code

Statistical subtraction of list data elements

from collections import Counter

test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)
test1.subtract(test2)
# return value:
# test1: Counter({'a': 3, 'b': 0, 'c': -3, 'd': -6})
# test2: Counter({'d': 4, 'c': 3, 'b': 2, 'a': 1})
Copy the code

Perform statistical calculations on list data elements

from collections import Counter

test1 = Counter(a=4, b=2, c=0, d=-2)
test2 = Counter(a=1, b=2, c=3, d=4, e=4)

result1 = test1 + test2  # counter add: add the same Key, keep different keys
result2 = test1 - test2  # counter subtraction: if the value of the same Key is 0, the value of different keys is 0
result3 = test1 & test2  # counter intersection: take the same key, small value
result4 = test1 | test2  # counter union: take all keys, and value is larger if the key is the same
# return value:
# result1: Counter({'a': 5, 'b': 4, 'e': 4, 'c': 3, 'd': 2})
# result2: Counter({'a': 3})
# result3: Counter({'b': 2, 'a': 1})
# result4: Counter({'a': 4, 'd': 4, 'e': 4, 'c': 3, 'b': 2})
Copy the code

defaultdict

The official document: https://docs.python.org/2/library/collections.html#collections.defaultdict

Obtaining the Dict key does not exist

from collections import defaultdict

test = defaultdict(str)
test['key1'] = '1'
test['key2'] = '2'
Fetching a nonexistent Key will use an empty object corresponding to the instantiated type as initialization data
# str -> "" | int -> 0 | list -> list() | dict -> dict() | set -> set() | tuple -> tuple() 
v = test['medusa']

# return value:
# v: ""
# test: defaultdict(<class 'str'>, {'key1': '1', 'key2': '2', 'medusa': ''})
Copy the code

deque

The official document: https://docs.python.org/2/library/collections.html#collections.deque

Specifies the length of the ListQueue

# First In - First Out, FIFO
from collections import deque

my_queue = deque(maxlen=10)

for i in range(10):
    my_queue.append(i+1)

print(my_queue)
Output: # deque ([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], maxlen = 10)

print(my_queue.popleft())
Output: 1

for i in range(5):
    my_queue.append(i+1)
print(my_qeueu)
Output: # deque ([6, 7, 8, 9, 10, 1, 2, 3, 4, 5], maxlen = 10)
Copy the code

namedtuple

The official document: https://docs.python.org/2/library/collections.html#collections.namedtuple

Tuple specifies the index object

from collections import namedtuple

# to create data template, the name of the Person, the data template domain name name | description | forever | size
Person = namedtuple('Person'.'name description forever size')

Create two index objects using a template. Both methods are equivalent
Medusa = Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
You = Person._make(['You'.'... '.True.'Max'])

print(Medusa)
print(You)
# output:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# You: Person(name='You', description='... ', forever=True, size='Max')

Changing the value of the property actually generates a new object
update_Medusa = Medusa._replace(description='https://juejin.cn/user/2805609406139950')
print(Medusa)
print(update_Medusa)
# output:
# Medusa: Person(name='Medusa', description='Medusa blog', forever=True, size='Max')
# update_Medusa: Person(name='Medusa', description='https://juejin.cn/user/2805609406139950', forever=True, size='Max')

# output dictionary
print(Medusa._asdict())
Output: # OrderedDict ([(' name ', 'Medusa'), (' description ', 'Medusa blog), (" forever ", True), (' size', 'Max')])
Copy the code

pandas + numpy

Official document: https://www.pypandas.cn/

Read and write file data

import pandas as pd

df = pd.DataFrame(pd.read_csv('csv_name.csv',header=1)) 
df = pd.DataFrame(pd.read_excel('xlsx_name.xlsx'))
Copy the code
Read Write
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_json
read_msgpack(experimental) to_msgpack(experimental)
read_html to_html
read_gbq (experimental) to_gbq (experimental)
read_stata to_stata
read_sas
read_clipboard to_clipboard
read_pickle to_pickle

Read_csv Parameter description

  • filepath_or_buffer: strSupports strings or any readable file object, including UEL files
  • sep: strSpecifies a data separator, the default is to try “,” separator, separator is longer than one character and is not “\s+”, will use Python’s parser, and ignores commas in data
  • delimiter: strDelimiter, alternative delimiter. If this parameter is specified, the sep parameter is invalid
  • delim_whitespace: boolSpecifying whether Spaces are used as separators is equivalent to setting sep=”\s+”. If this parameter is set to “True”, the delimiter parameter is invalid
  • header: int or list of intsIf header=0 is set explicitly, existing column names will be replaced. If list is set explicitly, these rows in the file will be used as column headers (meaning multiple headers per column). Intermediate rows will be ignored. If skip_blank_lines=True, the header argument ignores comment lines and empty lines, so header=0 represents the first line of data rather than the first line of the file
  • names: array likeThe list of column names used for the result. If there are no column header rows in the data file, header=None is required. The default list cannot be duplicated unless mangle_dupe_cols=True is set
  • index_col : int or sequence or FalseThe column number or column name used for the row index. If pandas is given a sequence with multiple row indexes, index_col=False can be set to disable the first column for the row index if the file is irregular and the delimiter is at the end of the row
  • usecols: array-likeReturns a subset of data whose values must correspond to locations in a file (numbers can correspond to specified columns) or characters to column names in a file, such as: Usecols valid arguments can be either [0,1,2] or [‘ foo ‘, ‘bar’, ‘baz’], which can be used to speed up loading and reduce memory consumption.
  • as_recarray: boolNot supported: this parameter will be removed in future versions, use pd.read_csv(…). To_records () instead, returns a recarray of Numpy in place of DataFrame. If this parameter is set to True, the squeeze parameter will be used in preference, and the row index will no longer be available, and the index column will be ignored
  • squeeze: boolIf the file value contains a column, return a Series
  • prefix: strPrefix a column when there is no column title
  • mangle_dupe_cols : boolDuplicate columns, representing multiple duplicate columns as “x.0 “…” X.N”, otherwise the column will be overwritten.
  • dtype: Type name or dict of column -> typeThe data type of each column
  • engine: "c" or "python"Specify the analysis engine, C engine is faster, but Python engine is more complete
  • converters: dictA dictionary of column conversion functions. keys can be column names or column ordinals
  • true_values: list Values to consider as True
  • false_values: list, Values to consider as False
  • skipinitialspace: boolWhite space after the delimiter is ignored
  • skiprows: list-like or integerThe number of lines to ignore (from the beginning of the file), or the list of line numbers to skip
  • skipfooter: intIgnore from the end of the file
  • skip_footer: intIgnore from the end of the file (not recommended)
  • nrows: intNumber of lines to read (from the start of the file header)
  • na_values: scalar, str, list-like, or dictA set of values used to replace NA/NaN, which, if passed, need to specify null values for specific columns. Default to “1. # IND”, “1 # QNAN”, “N/A”, “NA”, “NULL”, “NaN”, “NaN”
  • keep_default_na: boolIf the na_values argument is specified and keep_default_na=False, the default NaN is overridden, otherwise added
  • na_filter: boolWhether to check for missing values (empty strings or null values). For large files, there are no null values in the data set. Setting na_filter=False can speed up reading
  • verbose: boolWhether to print the output of various parsers
  • skip_blank_lines: boolIf True, blank lines are skipped, otherwise NaN is denoted
  • parse_dates: boolean or list of ints or names or list of lists or dict
    • Passing True will parse the index
    • Passing a list of ints or names(for example [1, 2,3]) will parse the values of columns 1,2, and 3 as separate date columns
    • Passing a list of lists(e.g. [[1, 3]]) will merge the 1,3 columns as a single date column
    • Passing dict(for example {“foo”: [1, 3]}) merges columns 1,3 and names the merged column “foo”
  • infer_datetime_format: boolIf set to True and parse_dates is available, pandas will attempt to convert to the date type and, if it can, convert the method and parse. In some cases five to ten times faster
  • keep_date_col: boolIf joining multiple columns resolves dates, the participating columns are kept
  • date_parser: functionThe function used to parse dates. By default, dateutil.parser.parser is used. Pandas tries three different ways of parsing and uses the next method if it encounters problems
    • Use one or more arrays(specified by parse_dates) as arguments
    • The join specifies a multi-column string as a single column as an argument
    • The date_parser function is called once per line to parse one or more strings (specified by parse_dates) as arguments
  • dayfirst: boolDate type in DD/MM format
  • iterator: boolReturns a TextFileReader object to process the file block by block
  • chunksize: intSize of the file block
  • compression: "infer" or "gzip" or "bz2" or "zip" or "xz" or NoneDirectly use the compressed files on disk. If the infer parameter is used, the files with the specified suffix will be decompressed in a specified manner
  • thousands: strThousandth separator
  • decimal: strA decimal point in a character
  • float_precision: str Specifies which converter the C engine should use for floating-point values. The options are None for the ordinary converter, high for the high-precision converter, and round_trip for the round-trip converter
  • lineterminator : strLine separator, used only under the C parser
  • quotechar: strQuotation marks, used to identify beginning and explanation characters. Delimiters within quotation marks are ignored
  • quoting: int or csv.QUOTE_* instanceControl quotation marks in CSV constants with options from QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
  • doublequote: boolDouble quotation marks, when single quotation marks have been defined, and the quoting parameter is not QUOTE_NONE, double quotation marks are used to indicate that the element within the quotation is used as an element
  • escapechar: strWhen quoting QUOTE_NONE, the use of a character is specified without the delimiter limit
  • comment: strIf the character appears at the beginning of the line, the line will be ignored entirely. This argument can only be a single character. Empty lines (just as skip_blank_lines=True) comment lines are ignored by header and Skiprows. If comment=’#’ parse “#empty\na,b,c\n1,2,3” with header=0 then the result will be returned with ‘a,b,c’ as header
  • encoding: strSpecifies the character set type, usually ‘UTF-8’
  • dialect: str or csv.Dialect instanceIf no specific language is specified, sep is ignored if it is larger than one character
  • tupleize_cols: bool Leave a list of tuples on columns as is (default is to convert to a Multi Index on the columns)
  • error_bad_lines: boolIf a row contains too many columns, the DataFrame is not returned by default, and if set to False, the row changes are culled (only with the C parser).
  • warn_bad_lines: boolIf error_bad_lines=False and warn_bad_lines=True then all “bad lines” will be printed (only for C parsers)
  • low_memory: boolSet to False or specify the type using the dtype argument. Note that chunksize or iterator blocks read the entire file into a Dataframe. Ignore the type (only valid in the C parser)
  • buffer_lines: int– This parameter will be removed in future releases as its value is not recommended in the parser (not recommended)
  • compact_ints: boolThis parameter will be removed in future versions (not recommended). If compact_ints=True, then any columns with integer types will be stored as the smallest integer type, depending on the use_unsigned argument
  • use_unsigned: boolThis parameter will be removed in future releases (not recommended) and specifies whether the compressed integer column is signed or unsigned if it is compressed (i.e. compact_ints=True)
  • memory_map: boolIf the file in use is in memory, use the map file directly. In this way, you can avoid IO operations on files again