Navigation X
ALERT
Click here to register with a few steps and explore all our cool stuff we have to offer!



 1808

Tips to get faster results with a data archive search engine

by SirHugs - 24 February, 2022 - 08:22 PM
This post is by a banned member (SirHugs) - Unhide
SirHugs  
Godlike
991
Posts
139
Threads
4 Years of service
#1
Im currently building a data search archive with my dev, and like currently were only able to look through 4gb file in 0.7 seconds and were wishing to get the results faster.. some people in the sb were saying like how you need a good .sh file and like how they can search through 15b records in 10 seconds but they didnt elaborate.. So if you could help me with this, Ill figure a way to pay you back some how. Just let me know, ill be very grateful!  PepeBlush
This post is by a banned member (UberFuck) - Unhide
UberFuck  
Godlike
1.557
Posts
375
Threads
5 Years of service
#2
Hate to respond with questions, but could you elaborate on what you mean by "data search archive"?  What kind of data are you looking at?  What technology stack are you currently using?  Also, is this a private project (where it's only going to be run on your hardware)?

In general...something to consider when reading large files is that you have to read the data from disk to get into memory before you can perform any operations on the data.  That's the reason why you have database servers now with terabytes of RAM...if they can keep their most used datasets in memory then you don't have to wait for any disk read operations.  If you can, lower the disk IO as much as possible.
This post is by a banned member (215B5D) - Unhide
215B5D  
Registered
7
Posts
1
Threads
2 Years of service
#3
I presume by "good sh file", they're running a shell script that takes some arguments, may look a little like this:
 
Code:
cat <file> | grep <string> # cat db.txt | grep username

There are some good programs for searching through DBs, one of those I recommend being ripgrep (Written in Rust)
However, it'd probably be more beneficial & lightweight to write your own, try using C / C++ for it :)

for communicating between your webserver & C / C++ program, look into IPC (Inter Process Communication)
This post is by a banned member (SirHugs) - Unhide
SirHugs  
Godlike
991
Posts
139
Threads
4 Years of service
#4
(This post was last modified: 25 February, 2022 - 05:51 AM by SirHugs.)
(24 February, 2022 - 09:55 PM)foxegado Wrote: Show More
Hate to respond with questions, but could you elaborate on what you mean by "data search archive"?  What kind of data are you looking at?  What technology stack are you currently using?  Also, is this a private project (where it's only going to be run on your hardware)?

In general...something to consider when reading large files is that you have to read the data from disk to get into memory before you can perform any operations on the data.  That's the reason why you have database servers now with terabytes of RAM...if they can keep their most used datasets in memory then you don't have to wait for any disk read operations.  If you can, lower the disk IO as much as possible.

Ah okay, thank you so much! and what I mean by that is like for databases like just finding an email and password with a email search.. He already indexed one of the dbs already and this is being run on interserver.. And jeez yeah idk how expensive terabytes of ram is gonna be lmfao, but yeah thank you for the advice with lowering the disk IO.. But if you have any other advice, you could help me with I'd be really grateful.. AlsoI have no clue what technology stack hes using, my dev is doing most of it were just trying to figure a way to search through it faster as it currently takes 0.7s for a 30M file

(24 February, 2022 - 10:47 PM)215B5D Wrote: Show More
I presume by "good sh file", they're running a shell script that takes some arguments, may look a little like this:
 
Code:
cat <file> | grep <string> # cat db.txt | grep username

There are some good programs for searching through DBs, one of those I recommend being ripgrep (Written in Rust)
However, it'd probably be more beneficial & lightweight to write your own, try using C / C++ for it :)

for communicating between your webserver & C / C++ program, look into IPC (Inter Process Communication)

Thank you sir, Ill try it out! Thank you for the advice!!! I really appreciate it! Heart
This post is by a banned member (UberFuck) - Unhide
UberFuck  
Godlike
1.557
Posts
375
Threads
5 Years of service
#5
(25 February, 2022 - 05:51 AM)hugging Wrote: Show More
And jeez yeah idk how expensive terabytes of ram is gonna be lmfao, but yeah thank you for the advice with lowering the disk IO.

Lol, yeah...in my last job I had a few database servers that cost over $100k each for just the hardware.  Wasn't saying you need that, just making a point that you want to reduce disk IO (and definitely network IO if not using DAS).

Not sure if we are talking apples to apples though.  It almost sounds like he's reading directly from a plain text file (0.7s for a 30M file)...not from a database service (ie Microsoft SQL, Oracle, MySQL, Postgres, MongoDB).  You should be able to import and parse a combo file into a database table, and once it's in the table create indexes on the username and password columns for querying.

I would recommend looking at the code for existing projects for the Breach Compilation leaks (ie collection #1, collection #2, etc).  Here's a few I found in a couple minutes of googling:  
https://www.tevora.com/threat-blog/diy-l...redential/
https://github.com/sensepost/Frack
https://github.com/petercunha/skidloader
This post is by a banned member (SirHugs) - Unhide
SirHugs  
Godlike
991
Posts
139
Threads
4 Years of service
#6
(This post was last modified: 26 February, 2022 - 04:27 AM by SirHugs.)
(25 February, 2022 - 01:25 PM)foxegado Wrote: Show More
(25 February, 2022 - 05:51 AM)hugging Wrote: Show More
And jeez yeah idk how expensive terabytes of ram is gonna be lmfao, but yeah thank you for the advice with lowering the disk IO.

Lol, yeah...in my last job I had a few database servers that cost over $100k each for just the hardware.  Wasn't saying you need that, just making a point that you want to reduce disk IO (and definitely network IO if not using DAS).

Not sure if we are talking apples to apples though.  It almost sounds like he's reading directly from a plain text file (0.7s for a 30M file)...not from a database service (ie Microsoft SQL, Oracle, MySQL, Postgres, MongoDB).  You should be able to import and parse a combo file into a database table, and once it's in the table create indexes on the username and password columns for querying.

I would recommend looking at the code for existing projects for the Breach Compilation leaks (ie collection #1, collection #2, etc).  Here's a few I found in a couple minutes of googling:  
https://www.tevora.com/threat-blog/diy-l...redential/
https://github.com/sensepost/Frack
https://github.com/petercunha/skidloader

Yeah, thank you! he said that he was already using MongoDB, and he said he made a script to download and inset the mongoDB via scripts written in nodejs..

he currently looking at the githubs though, thank you for all your help!

(25 February, 2022 - 01:25 PM)foxegado Wrote: Show More
(25 February, 2022 - 05:51 AM)hugging Wrote: Show More
And jeez yeah idk how expensive terabytes of ram is gonna be lmfao, but yeah thank you for the advice with lowering the disk IO.

Lol, yeah...in my last job I had a few database servers that cost over $100k each for just the hardware.  Wasn't saying you need that, just making a point that you want to reduce disk IO (and definitely network IO if not using DAS).

Not sure if we are talking apples to apples though.  It almost sounds like he's reading directly from a plain text file (0.7s for a 30M file)...not from a database service (ie Microsoft SQL, Oracle, MySQL, Postgres, MongoDB).  You should be able to import and parse a combo file into a database table, and once it's in the table create indexes on the username and password columns for querying.

I would recommend looking at the code for existing projects for the Breach Compilation leaks (ie collection #1, collection #2, etc).  Here's a few I found in a couple minutes of googling:  
https://www.tevora.com/threat-blog/diy-l...redential/
https://github.com/sensepost/Frack
https://github.com/petercunha/skidloader

He said he really thanks you, apparently its super helpful

(25 February, 2022 - 01:25 PM)foxegado Wrote: Show More
(25 February, 2022 - 05:51 AM)hugging Wrote: Show More
And jeez yeah idk how expensive terabytes of ram is gonna be lmfao, but yeah thank you for the advice with lowering the disk IO.

Lol, yeah...in my last job I had a few database servers that cost over $100k each for just the hardware.  Wasn't saying you need that, just making a point that you want to reduce disk IO (and definitely network IO if not using DAS).

Not sure if we are talking apples to apples though.  It almost sounds like he's reading directly from a plain text file (0.7s for a 30M file)...not from a database service (ie Microsoft SQL, Oracle, MySQL, Postgres, MongoDB).  You should be able to import and parse a combo file into a database table, and once it's in the table create indexes on the username and password columns for querying.

I would recommend looking at the code for existing projects for the Breach Compilation leaks (ie collection #1, collection #2, etc).  Here's a few I found in a couple minutes of googling:  
https://www.tevora.com/threat-blog/diy-l...redential/
https://github.com/sensepost/Frack
https://github.com/petercunha/skidloader

Also he wanted me to ask this, if you have an answer to it "does reading the file in batches (like 100 lines per batch for example ) into memory then doing the necessary operations have any performance advantage over just reading the file line by line?"
This post is by a banned member (UberFuck) - Unhide
UberFuck  
Godlike
1.557
Posts
375
Threads
5 Years of service
#7
Glad I could help.
 
(26 February, 2022 - 01:43 AM)hugging Wrote: Show More
Also he wanted me to ask this, if you have an answer to it "does reading the file in batches (like 100 lines per batch for example ) into memory then doing the necessary operations have any performance advantage over just reading the file line by line?"

Generally yes, reading in large chunks will perform better, but a lot depends on what programming language you are using, what libraries are being used to read, and what operations you are performing on each line (ie between each read).  The only way to really tell which is going to perform best for you is to try different methods and benchmark them.  I'm not a nodejs developer, so I don't really have any recommendations regarding methods or third party libraries to try.

For inserting into MongoDB you want to use bulk inserts, or use the mongoimport utility if you can (depending on format of text files).  Avoid inserting/updating/upserting one row at a time (aka RBAR).

Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
or
Sign in
Already have an account? Sign in here.


Forum Jump:


Users browsing this thread: 2 Guest(s)