Lecture 6 File Processing¶
6.1 Viewing text files¶
6.1.1 Viewing complete files¶
-
cat
: 显示空行的行号 -
nl
:空行被忽略 -
more/less
:分页方式查看文本文件的所有内容 -
od
: 查看文本文件中的不可见字符(格式分析:默认以8进制显示文件)-
-c
以转义字符 -
-h
以十六进制显示
s193157@GOJ:~$ od -h /tmp/hello.cpp 0000000 6923 636e 756c 6564 3c20 6f69 7473 6572 0000020 6d61 0a3e 7375 6e69 2067 616e 656d 7073 0000040 6361 2065 7473 3b64 0a0a 6e69 2074 616d 0000060 6e69 2928 7b20 090a 6f63 7475 3c20 203c 0000100 6822 6c65 6f6c 2022 3c3c 6520 646e 3b6c 0000120 090a 6572 7574 6e72 3020 0a3b 0a7d 0000136
-
6.1.2 Viewing the head or tail of a file¶
head
s193157@GOJ:~$ head /etc/passwd #默认显示前10行
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
s193157@GOJ:~$ head -3 /etc/passwd #显示前3行
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
-
tail
- 实时的监控日志信息
6.1.3 Determining file size¶
wc
wc /tmp/student_records
5 20 92 /tmp/student_records
# 行数,单词数,字符数
wc -l /tmp/student_records
5 /tmp/student_records
wc -w /tmp/student_records
20 /tmp/student_records
wc -c /tmp/student_records
92 /tmp/student_records
# 查找系统中有多少用户
s193157@GOJ:/var/log$ wc -l /etc/passwd
476 /etc/passwd
# 一共476个用户
-
du
-
ls -l
6.2 Comparing Files¶
6.2.1 Comparing two files¶
diff
s193157@GOJ:/var/log$ diff /tmp/Fall_OH /tmp/Spring_OH
1c1 #change 改变
< Office Hours for Fall 2017
---
> Office Hours for Spring 2018
6a7 #append 增加
> 1:00 - 2:00 P.M.
9d9 #delte 删除
< 3:00 - 4:00 P.M.
12,13d11
< 2:00 - 3:00 P.M.
< 4:00 - 4:30 P.M.
通过diff可以进行增量修改从而进行版本管理
6.2.2 Patch¶
patch
文件是一个文件差异文件, 指明了两个文件的版本差异
# hello1 内容
s193157@GOJ:/var/log$ cat /tmp/hello1
Hello, world!
Welcome to the patch command.
# hello2 内容
s193157@GOJ:/var/log$ cat /tmp/hello2
Hello, world!
Welcome to the diff and patch commands.
Goodbye!
s193157@GOJ:~$ cp /tmp/hello1 .
s193157@GOJ:~$ cp /tmp/hello.patch .
s193157@GOJ:~$ ls
hello.patch hello1
s193157@GOJ:~$ patch hello1 hello.patch
patching file hello1
# 根据patch文件修改内容应用到源文件,记录增量更新
s193157@GOJ:~$ cat hello1
Hello, world!
Welcome to the diff and patch commands.
Goodbye!
6.2.3 Removing repeated lines¶
只能去掉出现连续出现的重复行,但是并不会修改源文件的内容, 只是将输出结果到屏幕上
s193157@GOJ:~$ cat /tmp/sample
This is a test file for the uniq command.
It contains some repeated and some nonrepeated lines.
Some of the repeated lines are consecutive, like this.
Some of the repeated lines are consecutive, like this.
Some of the repeated lines are consecutive, like this.
And, some are not consecutive, like the following.
Some of the repeated lines are consecutive, like this.
The above line, therefore, will not be considered a repeated
line by the uniq command, but this will be considered repeated!
line by the uniq command, but this will be considered repeated!
s193157@GOJ:~$ uniq /tmp/sample
This is a test file for the uniq command.
It contains some repeated and some nonrepeated lines.
Some of the repeated lines are consecutive, like this.
And, some are not consecutive, like the following.
Some of the repeated lines are consecutive, like this.
The above line, therefore, will not be considered a repeated
line by the uniq command, but this will be considered repeated!
s193157@GOJ:~$ cat /tmp/sample
This is a test file for the uniq command.
It contains some repeated and some nonrepeated lines.
Some of the repeated lines are consecutive, like this.
Some of the repeated lines are consecutive, like this.
Some of the repeated lines are consecutive, like this.
And, some are not consecutive, like the following.
Some of the repeated lines are consecutive, like this.
The above line, therefore, will not be considered a repeated
line by the uniq command, but this will be considered repeated!
line by the uniq command, but this will be considered repeated!
#统计重复出现的次数
s193157@GOJ:~$ uniq -c /tmp/sample
1 This is a test file for the uniq command.
1 It contains some repeated and some nonrepeated lines.
3 Some of the repeated lines are consecutive, like this.
1 And, some are not consecutive, like the following.
1 Some of the repeated lines are consecutive, like this.
1 The above line, therefore, will not be considered a repeated
2 line by the uniq command, but this will be considered repeated!
6.3 Compressing Files¶
- The
gzip
command- 压缩后源文件自动删除,解压后压缩文件删除,两个文件不会同时存在
- 解压
gizp -d filename
或者gunzip
- The
bzip2
command- 压缩比更高
6.4 Searching for Commands and Files¶
6.4.1 Find directory-list expression¶
-
-name pattern
-
-size +/-N
-
–exec or -ok CMD
#查看文件的详细信息
s193157@GOJ:~$ find . -name 'hello*' -exec ls -l {} \;
# -exec 添加额外的命令 -ok询问是否执行(添加了确认过程)
# {} 占位符
\( \)
6.4.2 whereis/which
¶
#列出对于的可执行文件在哪里
#根据PATH环境变量中的信息去查找
s193157@GOJ:~$ which rm
/bin/rm
# whereis 额外的列出了帮助文件
s193157@GOJ:~$ whereis rm
rm: /bin/rm /usr/share/man/man1/rm.1.gz
6.5 Sorting¶
- 字段分隔符默认是空格和定位符
sort: Ordering a set of items according to some criteria
-b Ignore leading blanks
-f Consider lowercases and uppercase letters to be equivalent
-t field separator
-r Sort in reverse order
-k Specify a field as the sort key
-n Compare according to string numerical value
Notice the LANG!
#第一关键字按字典序排序
s193157@GOJ:~$ sort /tmp/student_records
Amy Nash ECE 2.38
Jason Kim ECE 3.97
Jim DAVIS CS 12.71
Jonh Doe ECE 3.54
Pam Meyer CS 3.61
#按第四关键字排序(默认是字符串)
s193157@GOJ:~$ sort -k4 /tmp/student_records
Jim DAVIS CS 12.71
Amy Nash ECE 2.38
Jonh Doe ECE 3.54
Pam Meyer CS 3.61
Jason Kim ECE 3.97
#按第四关键字(数值)
s193157@GOJ:~$ sort -n -k4 /tmp/student_records
Amy Nash ECE 2.38
Jonh Doe ECE 3.54
Pam Meyer CS 3.61
Jason Kim ECE 3.97
Jim DAVIS CS 12.71
#按第四关键字逆序(数值)
s193157@GOJ:~$ sort -n -r -k4 /tmp/student_records
Jim DAVIS CS 12.71
Jason Kim ECE 3.97
Pam Meyer CS 3.61
Jonh Doe ECE 3.54
Amy Nash ECE 2.38
#按第四关键字,分隔符为(:)
s193157@GOJ:~$ sort -t: -k4 /etc/passwd
root:x:0:0:root:/root:/bin/bash
zq:x:1000:1000:zq,,,:/home/zq:/bin/bash
uuidd:x:100:101::/run/uuidd:/bin/false
syslog:x:101:104::/home/syslog:/bin/false
mysql:x:102:106:MySQL Server,,,:/nonexistent:/bin/false
messagebus:x:103:107::/var/run/dbus:/bin/false
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
postfix:x:104:111::/var/spool/postfix:/bin/false
6.6 Cutting and Pasting¶
6.6.1 Cut¶
-
截取文件
-
cut
默认字段分割符只能是定位符-t
s193157@GOJ:/tmp$ cut -f1-3 /tmp/student
-f filed
Jonh Doe ECE
Pam Meyer CS
Jim DAVIS CS
Jason Kim ECE
Amy Nash ECE
Jim DAVIS EECS
6.6.2 Pasting¶
- 拼接文件
s193157@GOJ:/tmp$ cut /tmp/a.txt
1 one
2 two
3 three
s193157@GOJ:/tmp$ cut /tmp/b.txt
1 yi
2 er
3 san
s193157@GOJ:/tmp$ paste /tmp/a.txt /tmp/b.txt
1 one 1 yi
2 two 2 er
3 three 3 san
6.7 正则表达式¶
-
用一个字符串描述满足某些特征的字符串集合
-
shell 元字符
*
:表示任意多个任意字符?
[]
-
正则表达式 正则表达式手册 (oschina.net)
|
:表示或者.
:表示匹配任意字符
^
:表示位于一行的开始
$
:表示一行的结束
-
\
:转义 -
?
:?
前的一个字符至多出现一次
-
*
:代表一个字符不出现或者出现任意多次
6.8 grep¶
- 找出包含查找单位的行
- 支持正则表达式
-w
全字匹配-i
忽略大小写-n
列出行号-c
只返回匹配行数-l
只列出文件名-v
输出不匹配的行egrep
增强版,支持|
s193157@GOJ:/tmp$ grep Kim /tmp/student
Jason Kim ECE 3.97
s193157@GOJ:/tmp$ grep CS /tmp/student
Pam Meyer CS 3.61
Jim DAVIS CS 12.71
Jim DAVIS EECS 12.71
s193157@GOJ:/tmp$ grep '\<CS' /tmp/student
Pam Meyer CS 3.61
Jim DAVIS CS 12.71
s193157@GOJ:/tmp$ grep -w CS /tmp/student
Pam Meyer CS 3.61
Jim DAVIS CS 12.71
6.9 Stream editor - Sed¶
流水线,非交互的编辑器
- 格式
-
注意
- 不会修改源文件的内容,只是将修改后的内容输出到屏幕
- 给定地址或者正则表达式后,根据地址修改,不给定地址,默认对每一行修改
sed
遵循最长匹配原则
-
选项
- 示例文件
jonh Doe ECE 3.54 Pam Meyer CS 3.61 Jim DAVIS CS 12.71 Jason Kim ECE 3.97 Amy Nash ECE 2.38 Jim DAVIS EECS 12.71
- 删除
-d
- 替换
's/seg/string'
把CS替换成EECS sed 's/\<CS/EECS/' student 对这行每一个匹配的都进行替换 g命令 sed 's/[0-9\.]/*/g' student 搜索包含某字符串的 /要包含的字符串/s/要替换的/替换成的/
- 打印
-p
(-n
不默认输出缓冲区的内容)
看中间的2-4行 s193157@GOJ:/tmp$ sed '2,4p' student 此时为 jonh Doe ECE 3.54 Pam Meyer CS 3.61 Pam Meyer CS 3.61 Jim DAVIS CS 12.71 Jim DAVIS CS 12.71 Jason Kim ECE 3.97 Jason Kim ECE 3.97 Amy Nash ECE 2.38 Jim DAVIS EECS 12.71 观察到2,4行输出了两次 因为sed会默认输出缓冲区的内容 -n 不默认输出缓冲区的内容 s193157@GOJ:/tmp$ sed -n '2,4p' student 此时为 Pam Meyer CS 3.61 Jim DAVIS CS 12.71 Jason Kim ECE 3.97
- 插入行
-i
在特定行前面加一个注释 s193157@GOJ:/tmp$ sed '/\<CS/i high_salary/' student jonh Doe ECE 3.54 high_salary/ Pam Meyer CS 3.61 high_salary/ Jim DAVIS CS 12.71 Jason Kim ECE 3.97 Amy Nash ECE 2.38 Jim DAVIS EECS 12.71 在后面加 s193157@GOJ:/tmp$ sed '/\<CS/i high_salary/a' student jonh Doe ECE 3.54 high_salary/a Pam Meyer CS 3.61 high_salary/a Jim DAVIS CS 12.71 Jason Kim ECE 3.97 Amy Nash ECE 2.38 Jim DAVIS EECS 12.71
!
取反
s193157@GOJ:/tmp$ sed -n '/\<CS/!p' student jonh Doe ECE 3.54 Jason Kim ECE 3.97 Amy Nash ECE 2.38 Jim DAVIS EECS 12.71
- 全部匹配
-g
- 禁用账号:使用
sed
更改 password 的最后一个字段 - 转换换行符:使用
sed
将\r
去掉- win
\r\n
- Linux
\n
- win
6.10 Awk¶
A programming language used for manipulating data and generating reports
- Form of awk command
awk ‘pattern’ filename
awk ‘{action}’ filename
awk ‘pattern {action}’ filename
不默认输出非匹配行
s193157@GOJ:/tmp$ awk '/Kim/' student //找出包含Kim的一行
s193157@GOJ:/tmp$ awk '{print $1,$2}' student //输出文件的第一个和第二个字段
// 默认的分隔符是空格或者定位符\t
// PS:cut默认的字段分割符是定位符 要用-d转换
//awk使用-F指定字段分隔符
s193157@GOJ:/tmp$ awk -F: '{print $1,$2}' /etc/passwd
//用空格和冒号同时做字段分隔符
s193157@GOJ:/tmp$ awk -F'[ :]' '{print $2}' databook //'[ ]' 表示一个集合
//输出CS的行的第一个和第二个字段
s193157@GOJ:/tmp$ awk '/\<CS/{print $1,$2}' student
//如果需要查普通的逗号和$ 要加双引号
- Awk's pattern is similar to C expression
Relationship expression
s193157@GOJ:/tmp$ awk '$4>3.5{print $1,$2}' student
s193157@GOJ:/tmp$ awk '$3=="CS"' student //也可以进行字典序比较
s193157@GOJ:/tmp$ awk '$2~/^D/' student //输出lastname以大写D开头的
//~为匹配运算符
//!~为不匹配
Conditional expression
s193157@GOJ:/tmp$ awk '$3=="CS"&&$4>4' student
Arithmetic expression
+,-,*,/,%,^(求幂次)
s193157@GOJ:/tmp$ awk '($2+$3+$4)/3>70{print $0,($2+$3+$4)/3}' score
Compound patterns
Range patterns
内置变量
$0 //一整行
NR //行号
s193157@GOJ:~$ awk 'NR>=2&&NR<=4' /tmp/student
s193157@GOJ:/tmp$ awk '{print NR,$0}' hello.cpp //添加行号
Last update:
August 27, 2023
Created: August 27, 2023
Created: August 27, 2023