While ODBC is API for accessing databases Microsoft Text ODBC driver preinstalled in Windows can be used for processing text files (txt and csv). That means SQL queries like SELECT can be executed to txt and csv files!
Using SELECT queries makes some advanced features all possible from command line without scripting!
- processing data as Numbers, Dates, Strings and other supported data types
- sorting rows with ORDER BY clause and grouping with GROUP BY clause
- selecting rows with conditions and LIKE clause
- finding minimum, maximum, and average values
- finding distinct values with SELECT DISTINCT query
In this article:
- Microsoft Text Driver connection string
- Schema.ini file to describe txt/csv file format
- examples of SQL queries to txt/csv files
Microsoft Text Driver connection string
Driver={Microsoft Text driver (*.txt; *.csv)};DBQ=C:\ProgramData\My Program\Data\;DriverId=27;MaxBufferSize=2048;PageTimeout=5
DBQ parameter
This is the only mandatory parameter telling the driver what folder to use. It may contain spaces requiring no quots:
DBQ=C:\ProgramData\My Program\Data\;
One can tell the driver to use current folder with dot character (.) as path value:
DBQ=.
Command line for nhdbt program:
Connect to files in current folder and run all SQL queries from test.sql
:
nhdbt "Microsoft Text driver (*.txt; *.csv)" "dbq=." test.sql
Microsoft Text Driver Schema.ini file
A Schema.ini file is required to describe data format in txt or csv files.
Schema.ini File (Text File Driver)
When the Text driver is used, the format of the text file is determined by using a schema information file. The schema information file is always named Schema.ini and always kept in the same directory as the text data source. The schema information file provides the IISAM with information about the general format of the file, the column name and data type information, and several other data characteristics. A Schema.ini file is always required for accessing fixed-length data. You should use a Schema.ini file when your text table contains DateTime, Currency, or Decimal data, or any time that you want more control over the handling of the data in the table.
Example of Schema.ini file for apache_access.csv
data file:
[apache_access.csv] Format=TabDelimited ColNameHeader=false CharacterSet=ANSI Col1=IPAddress char width 15 Col2=AccessTime date Col3=Request char width 250 Col4=Code Integer
As soon as Schema.ini file is saved in the folder the driver is able to accept connection.
DateTime format
Date and time formats is not explained on Schema.ini File (Text File Driver) so it can be a tricky thing.
By default the driver applies user’s short format.
For example default setting for English US will be dd/MM/yyyy HH:mm, but dd/MM/yyyy HH:mm:ss will also be accepted. Thus processing following log file with default English US setting will work while same attempt with European regional setting will fail because of date/time format mismatch:
70.69.152.165 02/22/2017 22:13:23 GET /blog/index.php HTTP/1.1... 128.196.108.201 02/22/2017 23:21:23 GET /favicon.ico HTTP/1.1... 147.106.118.104 02/22/2017 15:27:00 GET /company/ HTTP/1.1...
Date and time format can be changed with DateTimeFormat parameter. The custom format is applied to all Date columns in this file:
[apache_access.csv] ... DateTimeFormat=mm/dd/yyyy hh:nn:ss ...
It is important that formatting rules don’t follow standards used in Windows:
- nn has to be used instead of mm for minutes
- AM/PM not supported so it is case insensitive
US format | DateTimeFormat=mm/dd/yyyy hh:nn:ss |
European format | DateTimeFormat=dd/mm/yyyy hh:nn:ss |
SQL format | DateTimeFormat=yyyy-mm-dd hh:nn:ss |
When data format in the file does not match expected date/time format NULL value will be used.
Examples
Below are some examples of executing SQL queries to a Tab delimited file:
70.69.152.165 2017-02-22 19:39:27 GET /_media/logo_my_v2.png HTTP/1.1 304... 147.106.118.104 2017-02-22 18:00:37 GET /_css/master.1334356838.css HTTP/1.1 200... 65.98.119.36 2017-02-22 16:32:25 GET /_js/master.js HTTP/1.1 200...
Schema.ini file is as follows (not all columns have to be declared):
[apache_access.csv] Format=TabDelimited ColNameHeader=false CharacterSet=ANSI DateTimeFormat=yyyy-mm-dd hh:nn:ss Col1=IPAddress char width 15 Col2=AccessTime date Col3=Request char width 250 Col4=Code Integer
nhdbt program used for testing. All the files available to download at the end of this page.
Retrieve count of rows
SELECT COUNT allows easily find number of all or specific lines in the file.
How to count all rows in the file:
select count(*) from apache_access.csv Expr1000 ----------- 9946
How to count rows by Code:
select Code,count(Code) from apache_access.csv group by Code order by Code Code Expr1001 ----------- ----------- 200 7059 302 407 304 1525 401 224 403 5 404 527 500 75 503 124
Finding aggregated values
How to find minimum and maximum values:
select min(AccessTime),max(AccessTime) from apache_access.csv Expr1000 Expr1001 ------------------- ------------------- 2017-02-22 15:00:22 2017-02-22 23:33:31
Of course AVG and SUM also supported for Integer values.
Retrieve distinct values
How to list only distinct IP addresses sorted alphabetically:
select distinct(IPAddress) from apache_access.csv order by IPAddress asc IPAddress --------------- 101.92.120.16 128.196.108.201 14.47.165.89 147.106.118.104 161.71.8.142 19.174.45.8 221.125.19.252 30.75.225.192 34.87.4.6 46.139.54.184 5.35.225.115 52.87.131.109 65.98.119.36 70.69.152.165 78.235.33.64
Searching for text values
How to list users on Windows:
select distinct(UserAgent) from apache_access.csv where UserAgent like '%Windows%' UserAgent ------------------------------------------------------------------------------------------------------------ Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:23.0) Gecko/20131011 Firefox/23.0 Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1467.0 Safari/537.36 Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/4.0; InfoPath.2; SV1; .NET CLR 2.0.50727; WOW64)
How to list Opera user agents:
select distinct(UserAgent) from apache_access.csv where UserAgent like '%Opera%' UserAgent ------------------------------------------------------------------------------------- Opera/9.80 (iPad; Opera Mini/7.1.32694/27.1407; U; en) Presto/2.8.119 Version/11.10 Opera/9.80 (iPhone; Opera Mini/7.1.32694/27.1407; U; en) Presto/2.8.119 Version/11.10
Limitations
A delimited text line contains one or more data values separated by delimiters: commas, tabs, or a custom delimiter. The same delimiter must be used throughout the file.
Tables are limited to a maximum of 255 fields. Field names are limited to 64 characters, and field widths are limited to 32,766 characters. Records are limited to 65,000 bytes.
A text file can be opened only for a single user. Multiple users are not supported.
Downloads
All files include sample log file, Schema.ini, SQL file, test.cmd, and nhdbt.exe are available for downloading from Google Drive:
processing-text-files-with-microsoft-text-driver.zip
See also
More information on Microsoft Text driver can be found at Microsoft:
Schema.ini File (Text File Driver)
Text File Format (Text File Driver)
nhdbt is freeware program for Windows to execute SQL queries from command line.
Leave a Reply