Purpose of this paper: Since xiaocuai has tens of thousands of Excel format data to be duplicated, traverse the Excel table and compare whether ID and name are the same. If there is a duplicate id and name, put the duplicate ID data into a new Excel table 1. If there is no duplicate, put it into another new Excel table 2. The table data is shown as follows:This file is 56,495 lines, more than 150 meters long. Here’s the idea:

  1. Read the target file using the Perl Spreadsheet::ParseExcel module, comparing each line
  2. Use the Perl Spreadsheet::WriteExcel module to create Excel. Write the desired target data into the created Excel

Perl to TCGA database after the download file merge and conversion. The installation procedure for Windows is as follows:

cpan Spreadsheet::ParseExcel
cpan Spreadsheet::WriteExcel
Copy the code

The Mac, the installation

sudo perl -MCPAN -e "install 'Spreadsheet::ParseExcel'"
sudo perl -MCPAN -e "install 'Spreadsheet::WriteExcel'"
Copy the code

Step 1: Read the target Excel file

Target: First traverse the target Excel line with the following code:

#! /usr/bin/perl -w use strict; use warnings; use Spreadsheet::ParseExcel; my $parser = Spreadsheet::ParseExcel->new(); my $workbook = $parser->parse('test.xls'); if ( ! defined $workbook ) { die $parser->error(), ".\n"; } for my $worksheet ( $workbook->worksheets() ) { my ( $row_min, $row_max ) = $worksheet->row_range(); for my $row ( $row_min .. $row_max ) { print "Row = ($row)\n"; print "\n"; }}Copy the code

As a result, the memory overflow is caused by too many lines, as shown in the following figure:

(Note: Spreadsheet::ParseExcel only reads Excel documents in 95-2003 format. For Office 2007 Excel, install Spreadsheet::XLSX.)

The reason for checking data is that the program directly reads all file contents into the array when writing and reading file contents, so other ideas are needed to solve this problem.

Combined with online articles, there are the following ways of thinking to solve problems:

  1. Windows, it is recommended to use Win32: : OLE, efficiency will be much higher

The code is as follows:

use Win32::OLE qw(in with); use Win32::OLE::Const 'Microsoft Excel'; my $Excel = Win32::OLE->GetActiveObject('Excel.Application') || Win32::OLE->new('Excel.Application', 'Quit'); my $Book = $Excel->Workbooks->Open("$file",0,'True'); my $Sheet = $Book->Worksheets(1); #$maxCol = $Sheet-> UsedRange -> Columns->Count; $maxRow = $Sheet-> UsedRange -> Rows->Count; $i = 0; foreach my $row (1.. $maxRow) { # foreach my $col (1.. $maxCol) #{ $xm = $Sheet->Cells($row,2)->{'Value'}; $cj = $Sheet->Cells($row,3)->{'Value'}; $i++; }Copy the code
  1. If you use Spreadsheet::ParseExcel
  • Get the number of rows
  • And then we use the while loop
  • Don’t use foreach, don’t grep, use re.
  • Don’t use “for” either
  1. Exce is too big
  • Convert to CSV format (Save as Excel, you can choose CSV format)
  • It is then read line by line
  • The while () loop processes the data

Idea 1 does not have cross-platform features. The features of compatible platforms are poor. Thought 2 also consumes too much performance and is discarded. Train of thought 3:

  • Try traversing the converted CSV data as follows:
#! /usr/bin/perl use warnings; use strict; my $file = "mRNAsymbol.csv"; open my $fl, "< $file" or die "can not open the file:$file\n$!" ; while (<$fl>) { next if /^Name/; chomp; #my ($id,$name,$type) = split ",", $_; my ($id,$name,$type) = split ","; $_last unless $id; print "id:$id\n name:$name\n type:$type\n\n\n"; } close $fl;Copy the code

Or the following code (not currently available, only for this notation) :

#! /usr/bin/perl use warnings; use strict; my $file = "mRNAsymbol.csv"; open (FILE,"<$file") or die "Cannot open file $! \n"; while (defined (my $fl = <FILE>)) { next if /^Name/; chomp; #my ($id,$name,$type) = split ",", $_; my ($id,$name,$type) = split ","; $_last unless $id; print "id:$id\n name:$name\n type:$type\n\n\n"; } close (FILE);Copy the code