Collections
At this point, you have worked with collections a little bit. This chapter will delve into them in more detail. Collections are a group of records, which can be a representation of what models to fetch from the database - or a list of records themselves.
Iterating through Records
The most common thing to do with a collection is get the records within it. Collections behave the same way as Python lists, so you can iterate through them just as any other list:
>>> from intro import *
>>> users = User.all()
>>> for user in users:
... print user.get('username')
'john'
'jane'
...
Collection.records
Underneath the hood, this is actually calling the records
method for iteration. The same could have been written as:
>>> from intro import *
>>> users = User.all()
>>> for user in users.records():
... print user.get('username')
'john'
'jane'
...
Why is it important to know that? The first time a collection's records
property is called, it will decide if it has already been loaded from the database, or if it needs to go load itself. Every subsequent call will return the data that was loaded previously.
Collection.iterate
Often times, this is a perfectly acceptable way to iterate through a list of users. Sometimes however, particularly with large data sets, you will want to speed up the process by only loading sub-sets of information at a time. You can do this with the iterate
method:
>>> from intro import *
>>> users = User.all()
>>> for user in users.iterate():
... print user.get('username')
'john'
'jane'
...
Great. So what is different here?
In the call to records
, all the records of the collection are fetched first, then iterated through.
In the call to iterate
, we are actually batching the results. There is an optional batch
keyword argument to the iterator. As you loop through, and the iterator hits its end, it will then go and select the next subset of data. The default size of the batch is 100 records, so each 100 records a new query to the backend will be performed. You can modify your sample size by just changing the batch
keyword:
>>> from intro import *
>>> users = User.all()
>>> # select 10 users at a time vs. 100
>>> for user in users.iterate(batch=10):
... print user.get('username')
'john'
'jane'
...
Pagination
Similarly to iteration, you can also limit the records you fetch via paging. Paging with ORB is very easy -- just define the size of the page and which page you want to get. Assume we have a database with 100 users, with IDs 1-100. If I were to query that using paging, it would look like:
>>> from intro import *
>>> users = User.all()
>>> print len(users)
100
>>> print users.pageCount(pageSize=25)
4
>>> page_1 = users.page(1, pageSize=25)
>>> page_2 = users.page(2, pageSize=25)
>>> print page_1.first().id(), page_2.first().id()
1, 26
If you want to pre-define your page size, you can do so in the all
or select
methods as well:
>>> from intro import *
>>> users = User.all(pageSize=25)
>>> page_2 = users.page(2)
>>> print page_2.context().start, page_2.context().limit
25, 25
Pre-defining the page size is passed via context, which we will go more into later.
Start and Limit
Pagination often times makes sense to work with, however what really drives it is the start
and limit
operations. This will determine what record number to start at, and how many to fetch. Paging works by saying "limit to pageSize, and the start is the page number * the pageSize". You can have direct access to this by using slicing.
>>> from intro import *
>>> page_2 = User.select(start=25, limit=25)
>>> print page_2.context().start, page_2.context().limit
25, 25
The above would be the same way to represent page 2 in the paging structure.
Slicing
In addition to paging and manually setting start/limit, collections also support Python's slicing syntax. Before a collection has fetched records from the backend, you can slice it down and that will update the start and limit values accordingly.
To represent page 2 again using slicing:
>>> from intro import *
>>> users = User.all()
>>> page_2 = users[25:50]
>>> print page_2.context().start, page_2.context().limit
25, 25
Reading values only
Sometimes, all you need are the values from a collection vs. full object models. You can improve performance by only requesting the data that is needed for your system from the backend. To do this, you can ask for specifically values -- or even just unique values -- through the collection.
>>> users = User.all()
>>> print users.values('username')
'john'
'jane'
...
If you only request a single column, then you will get a list containing the values. If you request more than one column, you will get a list of list values back, where the contents of the list is based on the order of the columns requested.
>>> users = User.all()
>>> print dict(users.values('id', 'username'))
{
1: 'john',
2: 'jane',
...
}
If you are looking at non-unique columns and want to just get a sense of all the unique values with in a column, you can use the distinct
method to fetch only unique entries:
>>> users = User.all()
>>> print sorted(users.values('first_name'))
'Jane'
'John'
'John'
...
>>> print sorted(users.distinct('first_name'))
'Jane'
'John'
...
As you can see, the values
method will return all values for a column within your collection -- including the duplicate first name of 'John'. Using distinct
, we just get one instance per value.
Note:
If you request a reference by name from the collection, it will return an inflated model object unless you specify
inflate=False
. If you request a field, you will get the raw reference value.
>>> users = User.all()
>>> users.values('role')
<intro.Role>
<intro.Role>
...
>>> users.values('role_id')
1
1
...
Grouping
As we just showed, you can create a dictionary based on the returned data from the values
method. You can also use the built-in grouped
method for a collection to group data via multiple columns into a nested tree structure.
>>> users = User.all()
>>> users.grouped('role')
{<intro.Role>: <orb.Collection>, <intro.Role>: <orb.Collection>, ..}
If you pass in the optional preload=True
keyword, this method will fetch all the records in the database first, and then group them together. The resulting collections within the groups will be already loaded and no further queries will be made. This is useful if you are not loading a large set of data.
If you choose preload=False
(which is the default), then only the values of the columns will be selected. The resulting collections within the group will be un-fetched collections with query logic built in to filter based on the given column values that group them. This will result in more queries, but smaller data sets, which could be useful for generating a tree for instance that would load children dynamically.
Retrieving Records
Short of iterating through the entire record set, there are a few different ways that you can retrieve a record from a collection.
Collection.first and Collection.last
Often times, you will want to get the first and/or last record of a collection. Maybe you are just looking at the last comment that was made in a thread, or the original edit made to a record. The first
and last
methods will allow you to do this easily:
>>> comments = Comment.all()
>>> # get the first comment made
>>> first_comment = comments.first()
>>> # get the last comment made
>>> last_comment = comments.last()
How does the system know what is first or last? If not specified, it will default to ordering the record's IDs. First will return the lowest ID while last will return the highest.
This doesn't work for non-sequential IDs however, or if you want to sort based on some other criteria. To manage this, simply provide the ordering information:
>>> comments = Comment.all()
>>> comments.ordered('+created_by').first()
>>> comments.ordered('+created_by').last()
The ordered
method will return a copy of the collection with a new order context (see the Contexts page for more detail on ordering).
You could also pre-order the collection:
>>> comments = Comment.all(order='+created_by')
>>> comments.first()
>>> comments.last()
Collection.at
You can also lookup a record by it's index in the collection. You can do this using the at
or __getitem__
of the collection:
>>> comments = Comment.all()
>>> first = comments.at(0)
>>> last = comments.at(-1)
>>> first = comments[0]
>>> last = comments[-1]
Warning:
This is more expensive than using
first
andlast
because it will query all the records and then return the record by index. You should only use this if you already have, or are going to get, all the records.
For instance, you may want to do something like:
>>> users = User.all()
>>> for i in len(users):
... user = users[i]
... print user.get('username')
'john'
'jane'
...
In this case, the first time the users[i]
is called, all the records are looked up. Every subsequent request by ID will be against your already cached records, no more database calls will be made.
Modifying Records
After a collection has been defined, there are a number of ways to modify the records.
Collection.refine
The most common thing to do is to refine selections. This will allow you to make selection modifications for your record for reuse.
>>> users = User.all()
>>> johns = users.refine(where=Q('first_name') == 'John')
>>> janes = users.refine(where=Q('last_name') == 'Jane')
The refined collection will join the query of the base collection with any modifications and return a new query.
Collection.ordered
>>> users = Users.all()
>>> alpha = users.ordered('+username')
The ordered
method will update the order by properties for the collection's context and return a newly updated collection set.
Collection.reversed
>>> users = Users.all()
>>> alpha = users.ordered('+username')
>>> rev_alpha = alpha.reversed()
The reversed
method will invert the order by property and return a new collection set in reverse order.
Editing Records
In addition to using collections as a means of selecting and retrieving records, they are also used for bulk creation, editing and deletion.
Collection initialization
So far, we've seen two ways of accessing collections -- from the Model selectors (all
and select
) and from collectors. You can also just create an empty collection.
>>> empty = orb.Collection()
When you create a new blank collection, it will be null by default (no information whatsoever)
>>> empty = orb.Collection()
>>> empty.isNull()
True
You can also initialize a collection with a list of models.
>>> user_a = User({'username': 'tim', 'password': 'pass1'})
>>> user_b = User({'username': 'tam', 'password': 'pass2'})
>>> users = orb.Collection([user_a, user_b])
What can you do with this? Working and iterating through these records in memory is really not much different than using a regular list. However, you can use this to save these records to your backend data store with a single save call now:
>>> users.save()
Calling the save
method will create 2 new records in the data store.
Collection.add
In addition to initialization, you can dynamically add records to collections using the add
method. One thing to note - this method is context aware, meaning based on where the collection came from, it will behave slightly differently.
>>> users = orb.Collection()
>>> users.add(User({'username': 'tim', 'password': 'pass1'}))
>>> users.add(User({'username': 'tam', 'password': 'pass2'}))
>>> users.save()
This example starts with a blank collection, and then adds users to a list in memory, and then saves. This is functionally the same as initializing the collection with a list of records.
When you access a collection via a collector -- such as a ReverseLookup
or Pipe
, the add
method will actually associate the given record in the data store directly.
>>> group = Group.byName('admins')
>>> user = User.byUsername('john')
>>> # add a new record through a pipe
>>> user.groups().add(group)
>>> # add a new record through a reverse lookup
>>> role = Role.byName('Software Engineer')
>>> role.users().add(john)
In these examples, the add
method will dynamically associate the records through the collectors in the data store. These calls do not require a save afterwards.
For a pipe, a new record within the intermediary table is created (in this case, a new GroupUser
record is created with the user
and group
relation). If the relation already exists a RecordNotFound
will be raised.
For a reverse lookup, the add
method will set the reference instance to the calling record. In this example, we have a Role
model, a role
reference on the User
model, and a ReverseLookup
that goes through the User.role
column. Calling the role.users().add(john)
would set the role
column to the calling role instance on john, and save the record. Again, this does not rqeuire saving afterwards.
Collection.create
The create
method will act the same way as the add
method, except instead of accepting a record, it accepts a dictionary of column/value pairs and will dynamically create a record vs. associate it.
For this method to work, you will need to have defined what model is associated with a collection. If the collection is returned as a part of a collector, then it will automatically be associated.
>>> # create a generic user
>>> users = orb.Collection(model=User)
>>> users.create({'username': 'tim', 'password': 'pass1'})
>>> # create a pipe record, and associate it
>>> user = User.byUsername('john')
>>> user.groups().create({'name': 'reporters'})
>>> # create a reversed lookup
>>> role = Role.byName('Software Engineer')
>>> role.users().create({'username': 'tam', 'password': 'pass2'})
In these examples, we first created a basic User
record. The second example will create a Group
called 'reporters'
, and then associate it with the user john. The final example will create a new user automatically setting it's role to the 'Software Engineer'
record.
Creating vs. Updating
The save
function on collection can be used for bulk updates in addition to bulk saves. If you are dealing with multiple records, some of which are new, and some of which are updates, the system will automatically separate them out for you and create two queries.
>>> users = User.all()
>>> users.at(0).set('username', 'peter')
>>> users.at(1).set('username', 'paul')
>>> users.add(User({'username': 'mary', 'password': 'pass2'})
>>> users.save()
This will modify 2 records and create 1 -- with a total of 2 queries that will be made to the backend data store.
Deleting Records
Collection.delete
Bulk deletions can also be performed using collections. Again, depending on where the collection comes from -- different deletion options will be available to you.
>>> users = User.select(where=Q('first_name') == 'John')
>>> users.delete()
Calling this will perform a delete on all the users whose name is 'John', in one go from the database.
If you are working with the results of a collector (reverse lookup or pipe), you have a few more options for removing records.
Collection.remove
>>> # remove from a pipe
>>> g = Group.byName('editors')
>>> u = User(1)
>>> u.groups().remove(g)
>>> # remove from a reverse lookup
>>> r = Role(1)
>>> r.users().remove(u)
In this example, removing the 'editors'
group from the pipe will not delete the group record. Rather, it will remove the record from the intermediary table. Both the user and group will still exist at the end of the remove.
The second example, removing from a reverse lookup, will not actually remove either the Role or the User record, but instead will set the User.role
column value to null. This is how records are found for reverse lookups, so removing is simply setting the value to an empty value.
Collection.empty
This method is more similar to the Collection.remove
method than the Collection.delete
method. What this will do is it will remove intermediary tables for pipes, disassociate reverse lookups from their references, and do nothing for empty collections.
>>> # empty a pipe
>>> u = User(1)
>>> u.groups().empty()
>>> # remove from a reverse lookup
>>> r = Role(1)
>>> r.users().empty()
In these two examples, the first will remove all GroupUser
records where the user_id
is 1, this will not remove any groups or any users. The second will not actually remove any records at all, instead it will update all User
records whose role_id
is 1, and set it to None.
The next time the groups
or users
call is made for these records is performed, the collection will be empty.
How expensive are collections?
As we discussed in the Models chapter -- doing something like User.all()
will return a collection of all the user records.
But what does that mean?
>>> from intro import *
>>> users = User.all()
At this point, the users
variable does not actually have any records in it. It just has the knowledge that, when asked, it should go and fetch all User
records and return them. Depending on the action you take next, ORB will optimize how it interacts with your backend store.
For instance, if all I care about is the number of users I have, and not the users themselves, doing:
>>> users = User.all()
>>> len(users)
Will query the database for just the count of the users based on the selection criteria (in this case, all records) and return that number.
But doing:
>>> users = User.all()
>>> list(users)
>>> len(users)
Will query the database for all the user records and cache the results. Then, when I call len
will not query the database at all -- it will determine that we've already loaded the records, and will instead just ask for the length of the loaded list.
It is important to understand what functions are available for collections, but it is also important to keep in mind their relative expense.
Warning:
Each time
Model.all
orModel.select
is called, a new collection object will be returned. This means all previous caching will not exist for the new instance.